Soft (Gaussian CDE) regression models and loss 

functions 



Jose Hernandez-Orallo (jorallo@dsic.upv.es) 
Departament de Sistemes Informatics i Computacio 
Universitat Politecnica de Valencia, Spain 



November 7, 2012 



Abstract 

Regression, unlike classification, has lacked a comprehensive and effective approach to deal with 
cost-sensitive problems by the reuse (and not a re-training) of general regression models. In this paper, a 
wide variety of cost-sensitive problems in regression (such as bids, asymmetric losses and rejection rules) 
can be solved effectively by a lightweight but powerful approach, consisting of: (1) the conversion of 
any traditional one-parameter crisp regression model into a two-parameter soft regression model, seen as 
a normal conditional density estimator, by the use of newly-introduced enrichment methods; and (2) the 
reframing of an enriched soft regression model to new contexts by an instance-dependent optimisation 
of the expected loss derived from the conditional normal distribution. 

Keywords: Cost-sensitivive regression, asymmetric losses, Gaussian conditional density esitmation 
(CDE). 
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1 Introduction 



Common day applications of predictive models usually involve a full use of the available contextual infor- 
mation. When the operating context changes, one may fine-tune the by-default (incontextual) prediction or 
may even abstain from predicting a value (a reject). Consider a common case where a regression model 
has been built from some training data, and the model has to be deployed to new instances. If the context 
is the same for the new instances as it was for the training data, then the quality of the predictions will 
mostly depend on the observed quality of the model for the same context. However, if the context changes, 
the prediction given by the model may be suboptimal. For instance, if the model has been trained with a 
symmetric loss function but the deployment operating context involves an asymmetric loss function (where, 
e.g., underestimations have higher loss than overestimations), then predictions will need to be adjusted. In 
order to do this there are two options: (1) re-train or revise the model by using a possibly modified (e.g., 
oversampled) training data and the new loss function, or (2) use a reframing function which takes the model 
and the operating context and outputs a new reframed prediction. The first option is not always possible 
since many regression methods are not cost-sensitive or cannot be (easily) adapted to work with different 
(possibly complex) loss functions. Also, in the cases where the first option is possible, the training data must 
be preserved indefinitely and an important computational cost is incurred to retrain the model all over again. 
This is especially the case whenever the operating context changes recurrently, even for two consecuentive 
individual predictions. 

This kind of general problems has been profusely studied for classification, where the notion of operating 
context (or condition) is common and well understood. Some of the techniques and notions for addressing 
these cases are cost matrices, cost-sensitive classification [19], ROC analysis [61, 22, 32], threshold-choice 
methods [42], calibration [14, 3, 5] and, of course, the notions of soft classifiers (outputting a score or prob- 
ability) versus the notion of crisp classifiers (just outputting a label). Certainly, there have also been a few 
efforts to find the parallel of these techniques for regression. However, most of them rely on a crisp view of 
the regression model, i.e., they work with regression models which just output a value. Examples of this are 
the Regression Error Curves [7], utility-based regression [63, 65], the definition of ranking measures [58] 
and the use of transformation functions for regression which derive a global reframing that must be con- 
stantly (or polynomially) applied to the output of the regression model [1, 72]. None of these approaches 
represents the right mapping between classification and regression. Whenever we consider a scoring classi- 
fier (or a ranker) in classification, which can sort their predictions by their reliability (at least in the binary 
case), we should consider a regression model which can sort their predictions by their reliability. Whenever 
we consider a probabilistic classifier, which in fact outputs a discrete distribution on the labels (a categorical 
distribution), we should consider a regression model which outputs a continuous distribution (e.g., a normal 
distribution), and not a single value. This correspondence is shown in Table 1. 

From this correspondence, we see that the natural way of addressing context-sensitive problems in re- 
gression is the use of soft regression models (as soft classification models are the natural way of addressing 
context-sensitive problems). We need regression techniques which not only output the estimated expected 
value for each instance x, i.e. E(y|jc) (also referred to as the conditional mean), but also accompany these 
predictions with an estimated error, reliability or density function. There are many approaches for this. 
One approach is to obtain the standard error for each prediction as calculated by each specific technique 
(e.g., linear regression) if the algorithm provides a way to obtain this value for each prediction (which is 
not always the case). A second approach is to estimate the "reliability of individual regression predictions" 
[8], through sensitivity analysis, local averaging or other techniques, which can be applied to any regression 
method, as shown in [9]. A third approach is conformal prediction [60, 50, 49, 51], or any other method 
which derives a confidence interval. Finally, we can, of course, use conditional density (or distribution) esti- 
mation methods [57, 44, 40], which can derive the conditional probability density function of the dependent 
variable y, i.e. f(y\x), by using kernel or distribution mixtures. It has been recently said that "conditional 
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Classification 



Regression 



Crisp 



Soft 



a class label c(x) 

a score for each class * c (x) 



a numerical value m(x) and a re- 
liability measure f{x) (e.g., con- 
fidence interval) 



a numerical value m(x) 



Probabilistic a categorical distribution (char- 
acterised by a conditional prob- 
ability function p(y\x)). 



a continuous distribution (char- 
acterised by a conditional den- 
sity function f(y\x)) 



Table 1 : Correspondence between different types of classification and regression. Evaluation also depends 
on the kind of prediction. For instance, crisp prediction implies the comparison of the estimated output 
with the actual output, while probabilistic prediction implies the comparison of discrete distributions in 
classification (p(y\x) with p{y\x)) and the comparison of continuous distributions in regression (f(y\x) with 



density estimation has been studied extensively in economics and Bayesian statistics [..., but] it has received 
only little attention in the machine learning literature" [11]. One reason might be that conditional density 
estimation is not easy to apply for many regression methods. 

However, given these approaches for soft regression, none of them has been generally applied for 
context-sensitive problems, because either these proposals are inappropriate, or are much too complex. 
For instance, standard errors, reliability metrics and confidence intervals are useful to rank the predictions 
according to their reliability, or to address some tolerance issues, but they cannot be used to get a precise 
quantifiable magnitude of what the expected loss will be for an instance and a specific operating context. 
On the other hand, conditional density estimation looks like the appropriate setting for this, since we can 
(theoretically) calculate the expected loss (i.e., the risk) as an integral over all the possible values for the 
dependent variable, weighted by its density estimation. The problem is that it is not easy to calculate this 
minimisation since the estimated density function may be non-monotonic, non-convex or not even continu- 
ous. 

In this paper we propose a simple approach for soft regression. In most cases, it is just sufficient to have a 
good estimation of a conditional normal (i.e., Gaussian) density function. This has several advantages. First, 
a normal distribution only needs two parameters, the mean (expected value) and the variance. This makes 
it possible to estimate these two parameters easily. It can be done from the regression methods themselves, 
from any estimation of the standard error, from a confidence interval and, of course, from any other more 
complex density function or mixture thereof. Second, the variance can be used to rank predictions in a very 
straightforward way, as is done with reliabilities, but with a clear interpretation of the magnitudes. Third, 
and most importantly, we can work analytically with the normal distribution and smoothly derive the exact 
expression leading to the output that minimises the expected loss for many common loss functions. 

In fact, we will see that there are extremely simple methods to estimate this variance which can be 
applied to any crisp regression technique. Some of these methods are just based on comparing the prediction 
for the training dataset with the actual value, disregarding the input domain. In this sense, these methods are 
closely related to calibration methods in classification. However, we call them 'enrichment' methods since 
they preserve the original prediction mean, while only adding a second parameter, the variance, to form a 
more powerful and flexible soft regression model for context-sensitive applications. 

Many common applications of regression where deployment contexts can change are then solved by this 
setting: cost-sensitive applications where we have asymmetric losses, screening applications where we need 
rejection rules to determine the examples for which no prediction will be issued, auction and retailing bids 
where prices (or other continuous variables) are chosen to obtain the maximum expected profit, situations 
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where we want to derive the probability that two or more predictions are in the right order, etc. 

In what follows, we analyse and derive the solutions for many of these problem families, using two- 
parameter regression models and deriving the optimal prediction for the corresponding loss function. For 
each of these families, we perform a complete set of experiments which show that our general approach is 
not worse than some specific solutions in the literature for some of these families, and is clearly better for 
others. As a result, the setting and methodology we introduce in this paper can be effectively (and easily) 
applied to a wide range of context-sensitive problems. 

In brief, the goal of the paper is to show that two-parameter regression (as an estimation of conditional 
mean and variance or, more easily, as an enrichment of a crisp model by simple conditional variance es- 
timations), followed by a probabilistic reframing assuming a normal distribution, is a simple, general and 
powerful method which can successfully address many kinds of problems. 

The paper is organised as follows. Section 2 introduces some notation about regression models and loss 
functions, as well as relevant previous work which triggers the introduction of the general notion of refram- 
ing. We briefly define different types of reframing and the optimal prediction expressions for probabilistic 
reframing. From here, the objective of the paper is re-stated in a more precise way and the experimental 
methodology is settled for the rest of the paper. Section 3 analyses and compares several conditional density 
estimation methods and other methods for soft regression, as well as methods which can be used for the 
enrichment of a crisp model by the use of univariate conditional variance estimation methods. Since there 
are many possible approaches, this analysis is necessary for choosing only a few good, simple methods 
for the following sections. Section 4 formally derives the optimal decision rules for the two loss function 
representing bids in auctions, sales or other trading scenarios, assuming a conditional normal distribution. 
We perform a complete set of experiments using probabilistic reframing with different enrichment methods 
and compare them against a global reframing method based on a constant shift over the training set. Section 
5 performs a similar procedure for asymmetric losses: absolute (linear) and squared (quadratic). We also 
compare to two global reframing methods, one based on a constant shift and another based on a polyno- 
mial shift. Section 6 also introduces loss functions representing the situation where we have rejection rules 
in regression. Similarly, we derive and reuse the expressions for optimal reframing and compare several 
approaches. Section 7 makes a comprehensive analysis of results, suggests many other applications which 
could be modelled as loss functions and closes the paper with a summary of the contributions and the work 
ahead. Several appendices (which can be skipped on a first reading) complete the paper with some additional 
information for the datasets and metrics used in the experiments, more detailed results for some techniques 
that have been dismissed and some proofs. 

2 Background 

We start with some basic definitions and notation, followed by some related work. Then we introduce the 
key notion of reframing (and the distinction between global and local reframing). 

2.1 Regression and conditional densities 

Let us consider a multivariate input (or predictor) domain XcR'' and a univariate output (or response) 
domain ¥cl. The domain space D is then X x Y. Labelled examples or instances are just pairs (x,y) G D, 
and datasets are subsets of D. Unlabelled examples are elements x € X, sometimes represented as (jc, ?). 
We denote by Dx and Dy the projection of D for the input domain and output domain respectively. A crisp 
regression model m is a function m : X — > Y. A soft regression model accompanies each prediction with 
a reliability, confidence or, more generally, a conditional probability density function f(y\x) with y € Y 
and x G X. The corresponding cumulative distribution function is F(y\x) = f^fit^dt = p(Y < y\x), i.e. 
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the probability of the output being lower or equal than y for an input x. The estimated expected value 
(conditional mean) is denoted by }Xj{x) = E^(y|jc) = J^yfiy^dy. We denote its (conditional) standard 
deviation by 6y(jc). We will drop the subindices when clear from the context. Note that the mean and the 
standard deviation are conditional, i.e., defined for one single example; these are not the mean and standard 
deviation of a distribution of examples (or a whole dataset). 

We can derive a crisp regression model from a conditional density function f(y\x), as m(x) = fif(x). The 
target function will be represented by a true density function f(y\x). If the target function is deterministic, it 
can be represented with a Dirac delta function (all the density mass falls over the true single value) or more 
simply, as a deterministic function m : X — > Y. 

The normal distribution will be represented as usual , G 2 ) , with probability density function 0„ a z (•) 
and cumulative distribution function &uo 2 (')- For the standard probability density function and the standard 
cumulative distribution function we will drop the subindices, and we will write </>(•) and <&(•) respectively. 

2.2 Cost-sensitive problems and loss functions 

In context-sensitive learning, there are several features which describe a context, such as the data distribu- 
tion, the costs of using some input variables and the loss of the errors over the output variables. In this 
paper, we focus on loss functions over the output. As we will see, by properly defining the loss function and 
its parameters we can analyse and address many problem families. Let us start with the definition of loss 
function: 

Definition 1. A loss function is any function f:¥x¥4R which compares elements in the output domain. 
For convenience, the first argument will be the estimated value, and the second argument the actual value, 
so its application is usually denoted by £(y,y). 

Typical examples of loss functions are the absolute error (£ A ) and the squared error (£ s ), with £ A (y,y) = 
\y — y\ and £ s {y,y) = (y — y) 2 . These two loss functions are symmetric, i.e. for every y and r we have that 
£(y + r,y)=£(y — r,y). The are also commutative, i.e., for every y\ and y2 we have that £{yi,y2)=£(y2,yi)- 

While many methods use these generic loss functions (such as £ s ), most applications do have different 
loss functions. For instance, the bounded absolute error {£ba,&) is defined as £BA,p(y^y) = min{\y — y|,j3), 
which is also symmetric and commutative. Another example is the bid loss function £^(y,y) = — y + if 
y < y and otherwise, which is clearly asymmetric. In practice, there can be specialised loss functions for 
virtually any application domain. 

While some previous works in the literature of regression techniques have focussed on re-designing the 
learning technique to account for specific loss functions during training ([15, 46]), only a few have con- 
sidered the problem as a post-hoc process, once the model has been learnt. A post-hoc process can be 
performed in cases where re-training with the new loss functions is not possible (because of the regression 
technique is not cost-sensitive or because the training data is no longer available). It also has several ad- 
vantages, such as model reuse and the possibility of applying the same methods to virtually any regression 
technique. This post-hoc process can be traced back to the seminal work by Granger [37], showing that the 
optimal predictor for some asymmetric losses can be expressed as the conditional mean plus a constant bias 
term [38]. However, it is recognised that solving this term is not always easy (or even possible in closed 
form) for many loss functions and density functions. Specific results have been studied for some particular 
loss functions, such as Lin-Exp (approximately linear on one side and exponential on the other side) and 
Quad-Exp (approximately quadratic on one side and exponential on the other side), which have general 
solutions with mild conditions [71]. Conversely, general closed-form solutions for Lin-Lin (asymmetric 
linear) and Quad-Quad (asymmetric quadratic) do not exist in general [12] [13]. In fact, even general non- 
closed-form expressions are not always possible unless some constraints are imposed, such as continuous 
loss functions, finite expected loss and particular properties on the moments of the density function [20]. 
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In general, much of this work is restricted to continuous loss functions in time series or system reliability 
applications [2] [62], but provide sufficient evidence that working with complex density functions is very 
problematic for general (and possibly discontinuous or non-convex) loss functions. This has motivated the 
appearance of other approaches which do not use any density estimation, such as the calculation of a global 
function which is applied to the outputs [1, 72]. In this case, the restrictions come on the side of the loss 
function, which must be convex, and the requirement that the training set (technically, only the true values 
y) must be preserved from the training to the deployment stage. 

However, some other problems are not usually considered, not even as generalised loss functions [38]. 
Examples of these context-sensitive problems are rejection rules, where we want that the model abstains 
from outputting a prediction for the most unreliable cases. None of the previous approaches has addressed 
this problem. In fact, rejection rules are common in classification [28, 52], but rarely seen as cost-sensitive 
problems in regression. However, as we will see, these problems can be modelled with a loss function which 
sets a cost of rejection. Also, many regression problems used for product prescription, sale predictions and 
auction bids look at finding the bid price, i.e., the appropriate quantity (or other negotiable feature) which 
has the maximum expected benefit [4]. Many of these problems can also be modelled by a loss function 
and, yet again, the predictions of the model can be fine-tuned for them. 

As a result of this variety and diversity of problem families which can be modelled with loss functions 
or other kind of context information, we can integrate and generalise some of the existing (and new) model 
adaptation procedures into a more general term that we call reframing. 

2.3 Reframing and optimal predictions 

Given a loss function representing a particular context, the objective is to get predictions with low loss rather 
than predictions with low error. In order to do this, we do not train a model using this loss function (as risk 
minimisation approaches could do), because the loss function may not be known at the training stage or 
the regression technique may not be able to process loss information. Even if possible, re-training a model 
whenever the context changes is not a very efficient approach in terms of resources. It may also be inefficient 
in terms of reliability if the application requires stable, validated models. As an alternative, we propose the 
use of reframing functions, which adapt the predictions of the original model to the context, represented by 
a loss function. 

Definition 2. A reframing function is any method which produces a predicted output value given the input 
value x, the loss I and the model f. 

r(x,£,f)->y (1) 

where y represents the reframed output. 

Figure 1 shows the process of reframing graphically. Note that we do not impose any restriction on how 
the training data is obtained from the training context. Also, we do not assume that this data generation 
process has to be similar to the process generating the unlabelled data from the deployment context. In fact, 
the distributions of predictor x and response y will usually differ between contexts, as we will reflect in the 
experimental setting. 

For those crisp regression methods not using a density, we can assume a delta Dirac function f m , whose 
mean is clearly the prediction point given by the model, i.e., m(x) = flf(x), or we can define reframing 
methods which do not need a density, by expressing them as a function which only depends on the loss 
function and the expected mean for each example, as follows. 

r(x,£,f)=R(£,fl f (x)) 
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training context 



training data model 
Z^i learning 

<x,y> \fly\x) 



deployment data deployment context 




context-reframed 
y prediction 



Figure 1 : Refraining process adapting predictions from one context to a different context. 



Example 1. For instance, given the bid loss function if, with f3 



■ 0, we might consider the following global 



reframing function: 



R 1 (£ B p ,y) = 0.Sxy 



which just systematically reduces predictions by a 20%. The rationale here is given by the fact that over- 
estimations imply a loss (no deal) and under estimations always imply a benefit (there is a deal, and we 
assume thaty always represent positive prices, so giving a negative loss). So, a 20% reduction creates some 
margin which can produce higher overall benefit. 

Alternatively, a local (probabilistic) reframing could be done using f, if available. For instance, using 
the same bid loss function £^ with /3 = 0, a probabilistic reframing might be: 



p-\0.25) 



where F 1 is the quantile function for f, i.e., inverse of the cumulative function F. This means that we 
predict the value such that 25% of the expectancy for y is below that value. 

Figure 2 shows the use of Ri and r2 above for two different instances. 
In general, we can distinguish four kinds of reframing: 

• Constant global reframing: all predictions are modified in the same way independently of y, e.g. 
adding a constant s (i.e., y y + s). This constant s is called the shift. 

• Non-constant global reframing: predictions are modified using a (e.g. polynomial) function of y. 
While the shift is different for each example, it only depends on the prediction, and it can be considered 
a 'global' method, since it applies a global function. 

• Non-probabilistic local reframing: predictions are modified by a transformation of y using some re- 
liability or confidence parameters. For instance, we could define a reframing method which only 
modifies (or rejects) the instances that are below a given reliability threshold or above a percentage of 
the confidence width. 

• Probabilistic local reframing: the outputs are adapted according to a transformation over the condi- 
tional density function. If / is a parametric distribution, we can just use the parameters as arguments 
for the transformation. For instance, if / is a normal distribution, then we can just define the reframing 
transformation in terms of the mean y = fXf(x) and the conditional standard deviation dj-(x). 
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Figure 2: Representation of how predictions are shifted according to different reframing methods. The 
top row corresponds to an instance whose expected value is 16.28. Without reframing, this would be the 
output value. The top-left plot shows how the prediction moves to 13.02 (applying the global reframing R\, 
which multiplies the prediction by 0.8). The top-right plot shows a soft regression model with a normal 
conditional density function / which gives a mean at 16.28 and standard deviation of 5.37 for this instance. 
The prediction moves to 12.66 (applying the local reframing r%, using F _1 (0.25)). The bottom row shows 
a similar picture for a different instance where the expected value is 21.83. The soft regression model gives 
a standard deviation of 3.85 for this example. The reframing R\ leads to 17.46 (bottom left) while the 
reframing r2 leads to 19.23. As we see, the shift for the first instance (top) is greater for the local method 
(right), while the shift for the second instance (bottom) is greater for the global method (left). 
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The notion of shift (and refraining in general) for regression is closely related to similar procedures which 
are usually performed in classification. For instance, when we have a crisp classifier, we can (randomly) 
tweak some of the predictions, in order to get a more balanced set of predictions or to favour one class 
against others according to a cost matrix. This view would correspond to the global refraining methods 
above. For soft classifiers, the choice of an appropriate threshold to convert scores into predictions would 
correspond to local reframing. In particular, when we work with calibrated classifiers and we try to find the 
optimal thresholds (as in [48]) using a probabilistic setting, we have a scenario that is parallel to probabilistic 
local reframing. 

In regression, approaches to global reframing rely on the calculation of a constant (or function) from 
the training set. For instance, one simple method for global constant reframing is the calculation of the best 
constant shift for the training set given a loss function. One problem of these methods is that we do not 
always have the training (or a validation) dataset. Or even if we can have the dataset, it might be costly to 
keep it. Also, in many applications the loss function parameters may be different for each instance and the 
shift needs to be recalculated by exploring the whole training dataset all over again. Finally, this reframing 
may be problematic when the output distribution differs between the training dataset and the deployment 
dataset, because the global function is optimised for the training dataset. 

On the contrary, local reframing does not have the above-mentioned problems, but requires, in the 
probabilistic case, an accurate conditional density estimation and a method to derive the optimal reframing 
in an analytical or numerical way. In what follows, we will focus on the notion of 'optimal reframing' in 
the probabilistic (local) case, since the notion of optimality for the non-probabilistic (global) case is more 
elusive (since reliability or confidence measures cannot be used, in general, to quantify the expected loss). 

2.4 Optimal probabilistic reframing 

It seems reasonable to think that better decisions can be made if we take the conditional density function 
f(y\x) into account, rather than just the expected value E»(y|x), provided this density function is well- 
estimated. 

The maximum density reframing is given by r ma *(x,£,/) = argmax } ,/(y|.x), which ignores the loss func- 
tion, and just gives the point with maximum density. The mean reframing is given by T mean (x ,£,f) = 
^f(y\ x ) = I-ooyf(y\ x )dy = A/( x ) which also ignores the loss function. For some density functions, e.g., a 
normal distribution, we have r maA = r mean . 

In general, we want to take the loss function £ into account. For an unlabelled instance (x, ?), the 
expected loss (risk function) for prediction t is given by: 

/oo 
£(t,y)f(y\x)dy (2) 
-oo 

Then, the prediction with minimum expected loss is calculated by the following reframing function: 

/oo 
£(t,y)f(y\x)dy (3) 

This equation says that the optimal prediction (ignoring the uncertainty of the estimation of /) for each 
example x depends on its estimated distribution and the loss function. Interestingly, the previous equation 
for r* is independent of the data (marginal) distribution f(x), which means that it can be applied to each 
individual instance without considering the rest. This is important, since the loss function i may even vary 
for different instances. 

The question, now, is how to solve eq. (3). In some cases, if £ and / follow some properties, we can 
easily solve the equation. For instance, the following proposition gives the result for the easiest (well-known) 
case (the proof is in appendix H): 
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Proposition 1. If f is symmetric 1 and £ is symmetric and commutative then 

/oo 
yf{y\x)dy = r™ an {xAf) 
-oo 

which states that the optimal prediction is given by the mean of the conditional density function. 

But in many applications, i is not symmetric. Also, for many density estimation mehtods, / is not 
symmetric either. Only if / is chosen to be a simple distribution (e.g., a normal distribution), the equation 
can be solved analytically for specific, asymmetric loss functions, as we will see in the following sections. 
In the general case, however, the best prediction cannot be calculated in an analytical way and needs to be 
obtained by a numerical method, such as a Monte Carlo method, or any other method (e.g., hill-climbing) 
which can exploit the properties of particular cases for £ and /, such as (partial) monotonicity or convexity. 

2.5 Goals and experimental design 

Once we have the ingredients, terminology and concepts, we can state our research goal more properly. 
Namely, in the rest of this paper, we will answer several questions. First, are there general and practical 
conditional density estimation methods which can be used effectively to reframe an existing crisp regression 
model? In order to answer this question we need to explore techniques which can produce conditional 
density estimations for any crisp regression technique. We will focus on normal density estimations, because 
only two parameters are needed, mean and variance, and the former is already given by any regression model. 
This implies that we will be able to derive the reframing transformation relatively smoothly for most loss 
functions. Consequently, the following section will be devoted to the experimental analysis of several old 
and new approaches to normal (Gaussian) conditional density estimations. From this analysis we will be 
able to select some methods that will be used in subsequent sections. 

The second, ultimate, question of this paper is whether probabilistic reframing methods based on these 
simple estimators are able to solve a broad set of cost-sensitive problem families, including bidding prob- 
lems, asymmetric loss functions and rejection rules. We will devote a section to each of these families, we 
will derive the formal expressions for the reframing transformations and we will compare these methods 
with other previous specific methods in the literature addressing each of these problem families. We will see 
that probabilistic reframing is more general and effective. 

Apart from the theoretical derivations of the reframing transformations, an important part of the upcom- 
ing sections relies on experimental results. We will briefly describe the general experimental setting now, 
and we will let some other details for each specific section. 

We will use forty datasets, as shown in tables 1 1 and 12 in appendix A. The first battery of datasets 
will be used for the experiments in section 3. We will use the other battery for the experiments in sections 
4, 5 and 6. The reason for two different batteries is that we select the best conditional density estimation 
and enrichment methods in section 3 with the first battery, and we use these methods with a fresh and 
independent battery for the particular applications in the other sections. 

In all the experiments we will use 2-fold cross-validation without previously shuffling the datasets (i.e. 
the order is preserved before splitting them). This configuration tries to mimic a realistic situation where 
the training and test distributions may differ. Note that increasing the number of folds or shuffling the 
datasets would yield similar distributions for training and test (in terms for the predictors x and response 
y), which is a quite uncommon scenario in practice (although relatively usual in machine learning research 
experiments). It is important to highlight that it is not our goal to obtain an estimation of how well each 
method will perform for each dataset under exactly the same distribution (where usual re-sampling methods 

'The notion of symmetry for the loss function has been defined above. The notion of symmetry for a distribution is the classical 
notion of symmetry relative to the mean. 
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such as 10-fold cross-validation would be appropriate), but to compare several methods on a realistic setting 
where the context between training and test can change, including the data distribution. In order to illustrate 
how training and test distributions differ, two extra columns (TrTeMD, TrTeKS) in the dataset tables 1 1 and 
12 show the train- test relative means difference, calculated as ^ r '"""'~ /ir "^ , and the train-test Kolmogorov- 
Smirnoff statistic, respectively. Both are averaged for the two folds. The higher these values are the more 
dissimilar the training and test distributions are. 

In order to assess the significance of the experimental results we will use a custom procedure, following 
[45] and [31, ch.12], which in turn is mostly based on [16]. Since we will not have any baseline method, 
we will use a Friedman test to tell whether the difference between several methods is significant and then 
we will apply the Nemenyi post-hoc test. We agree with [35] that the Nemenyi test is a "very conservative 
procedure and many of the obvious differences may not be detected", but we prefer to be conservative 
given our experimental setting and the use of a 0.95 confidence level. In some result tables we will show 
the means (even though in many cases they are not commensurate) and in some other tables we will show 
the average ranks (from which the Friedman and Nemenyi tests are calculated). We will also include the 
critical difference for the Nemenyi test, so we will be able to simply tell whether the difference between two 
algorithms is significant if the difference between their average ranks is greater than the critical difference. 

3 Normal conditional density estimation (NCDE): enrichment methods 

A theoretically-optimal decision rule for a conditional density function and a loss function will only work if 
the conditional density function f(y\x) is accurate. While there are many techniques for conditional density 
estimation (CDE, see appendix C), they may be inappropriate for cost-sensitive scenarios. First, as they fo- 
cus on the whole conditional distribution, the estimated conditional mean given by these complex estimated 
conditional density functions is usually worse than the conditional mean output by many crisp regression 
methods. Second, CDE methods are usually slow. Third, in many problems, the actual conditional density 
functions are not multi-modal, and even if they are, it is not clear that adjusting many parameters to approx- 
imate this multi-modality will finally lead to the choice of a better (or even significantly different) optimal 
prediction for many loss functions. Finally, some problems are deterministic and what we really want is an 
estimation of the residual rather than (technically) a conditional density function for the output variable. 

Instead of complex (usually non-parametric) CDE methods, one of the simplest, most common, para- 
metric density functions is given by the normal (Gaussian) distribution. Estimating a normal distribution 
only requires the estimation of two parameters, the mean and the variance. It is important to clarify that 
the use of a normal conditional density f(y\x) ~ Jf does not entail — at all — that the output variable is 
distributed normally (f(y) ~ jV). Moreover, the use of an estimated normal conditional density f(y\x) does 
not even mean that we assume that the true conditional density f{y\x) is normal. In fact, when having an 
empirical dataset, we do not have information about the true conditional distribution; we just have examples 
for which its actual distribution can be seen as a Dirac delta function. In other words, the use of a normal 
conditional density function follows practical considerations and can be seen (at most) as a representation 
of the model's belief about how its uncertainty is distributed, i.e., a model of the distribution of residuals . 

Consequently, in this section we will explore and develop normal conditional density estimation meth- 
ods, or NCDE methods for short. This boils down to a soft regression model that, for every input instance 
x, just outputs two parameters: jx{x) and a(x). The estimation of jX{x) is the goal of all (crisp and soft) 
regression methods. Consequently, we will focus below on the estimation of d(x), comparing the results of 
several methods. The goal of this section is not to find the best estimator for a(x) as an isolated problem, 
but to find simple and general methods that work well when the conditional mean jx (x) is already given by 
any crisp regression technique. 

In order to perform the comparison, we need evaluation metrics for conditional density estimators. We 



11 



are interested in metrics that can evaluate (1) how good the conditional mean is, (2) how good the conditional 
variance is, and (3) how good the conditional density is (which is given by the qualities of the mean and the 
variance). For the conditional mean we will use the mean relative square error (mrse), a standardised version 
of the square error. For the conditional variance we will use the mean standardised variance ratio (msvr), 
which is a standardised metric of the ratio between the estimated variance and the squared residuals. Finally, 
for the conditional density we will use the mean standardised likelihood (msll), a standardised version of the 
log-likelihood. All these measures are standardised between (best) and 1 (worst). The exact formulations 
for these metrics can be found in appendix B. 

3.1 Directly estimating the variance from the regression techniques 

The first way of obtaining the mean and variance for each prediction is choosing a base regression technique 
which directly or indirectly is able to provide the variance (or a measure of standard error). In this paper, we 
will work with three common base regression techniques: 

• Linear regression (LR): many implementations of linear regression can calculate the standard errors 
for each predicted point, se{x). If this is the case, we can just set a(x) = se(x). The particular LR 
method we will use is ordinary least squares using the function lm of R [55] with default parameters. 

• Nearest neighbours (kNN): in this case the variance is calculated as the variance of the actual y values 
for the ^-closest elements. In particular, we use an unweighted ^-nearest neighbours algorithm using 
the Euclidean distance (with all the attributes scaled by the function scale in R) with k = 10. 

• Regression trees {Tree): in this case, one easy way of calculating the variance is to calculate the 
variance for the actual y values (in the training set) for each leaf of the tree. Then, for each new 
prediction on a new dataset, the variance will be given by the variance of the leaf where the example 
falls. We use the CART algorithm [10] implemented by the function tree in the package tree in R 
with its default parameters. 

We will use these three base techniques throughout the rest of the paper. Table 2 shows the result for the 
three methods above using the three evaluation metrics (mrse, msll, and msvr). 

Interestingly, we can see that for some datasets one method is better than the rest for the conditional 
mean (evaluated by mrse), while it can be the worst for the conditional variance (evaluated by msvr). In 
general, we see that LR gives the worst estimations for the conditional mean (mrse) and variance (msvr), 
which then implies bad results for the density estimation (msll). 

3.2 NCDE from conditional density, variance, reliability or confidence estimators 

The procedure seen above is based on using the variance derived from the own regression technique. These 
techniques are crisp, and are not really designed to obtain good conditional variances or densities. Instead, 
'soft' regression techniques (conditional density estimation, conditional variance estimation, reliability esti- 
mation and confidence estimation using conformal prediction) look more appropriate for deriving a normal 
conditional density estimator (NCDE) model. The use of these methods for NCDE would generally involve 
that we attempt a related, but different (and sometimes more complex) problem first, and then use some 
transformation or derivation from the soft model to the NCDE model. There is nothing against this, pro- 
vided the results are good and the procedure does not become extremely difficult or inefficient. Below we 
see why these two criteria are not met. A more detailed exploration is given in the appendices C, D, E and 
F, which also links to the literature. 
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LR LR LR 
mrse msll msvr 


kNN kNN kNN 
mrse msll msvr 


Tree Tree Tree 
nrse msll msvr 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

i t 

14 
15 
16 
17 
18 
19 
20 


0.28 0.81 0.65 
0.52 0.80 0.63 
0.02 0.48 0.65 
0.12 0.69 0.51 
0.11 0.84 0.85 
0.55 0.76 0.68 
0.46 0.75 0.53 
0.14 0.76 0.75 
0.07 1.00 0.97 
0.31 0.76 0.61 
0.19 0.77 0.63 
0.89 0.88 0.73 
n m n <s n 6.A 

J.UZ U.jo U.04 

U.T^ V.OJ VJ.OU 

0.50 0.75 0.73 
0.41 0.78 0.62 
0.97 0.99 0.91 
0.30 0.76 0.70 
0.33 0.72 0.64 
0.58 0.82 0.70 


0.35 0.80 0.50 
0.39 0.75 0.45 
0.16 0.61 0.62 
0.32 0.81 0.52 
0.11 0.63 0.51 
0.20 0.65 0.57 
0.45 0.76 0.61 
0.13 0.61 0.53 
0.29 0.99 0.91 
0.31 0.59 0.45 
0.15 0.74 0.63 
0.47 0.77 0.49 

U.'+J V.oy U.Ol 

0.55 0.73 0.59 
0.26 0.72 0.56 
0.39 0.78 0.54 
0.34 0.75 0.57 
0.34 0.70 0.52 
0.20 0.68 0.52 


0.41 0.85 0.65 
3.19 0.66 0.52 
0.21 0.62 0.50 
0.25 0.80 0.60 
0.11 0.63 0.51 
0.16 0.63 0.59 
0.50 0.76 0.54 
0.12 0.58 0.53 
0.26 0.99 0.94 
0.23 0.55 0.48 
0.16 0.68 0.65 
0.45 0.71 0.50 

J. £ ¥* u.oy U.D3 

J.Z.Z. U.UJ \J.J\J 

0.56 0.72 0.57 
0.22 0.69 0.59 
0.37 0.76 0.61 
0.35 0.74 0.54 
0.48 0.69 0.55 
0.26 0.69 0.58 


Mean 


0.36 0.78 0.70 


0.31 0.73 0.56 


3.30 0.71 0.58 



Table 2: Three regression techniques using their own conditional variance estimation methods. Results use 
the datasets in Table 11, using the experimental methodology in section 2.5 and the metrics in appendix B. 



Let us first review the most general approach, the direct estimation of a conditional density function 
f(y\x). Most conditional density estimation methods are designed to issue a complete model of the distribu- 
tion, which is usually non-parametric. Appendix C describes this approach and shows how it can be adapted 
to get a normal conditional density. It also includes some experimental results which show that there is 
no improvement over the base techniques using their own conditional variance estimation methods. Also, 
general conditional estimation methods are very inefficient and cannot be used as a post-processing step for 
a crisp regression technique. 

A second approach, conditional variance estimation (CVE), is much closer to our specific goal, and can 
be used to complement an existing crisp regression model by deriving a second parameter, the conditional 
variance, in order to make up a soft regression model. In fact, these methods can be understood as a post- 
processing step, which is applied to the whole training set, constructing a model of the residuals. Conditional 
variance estimation methods are explored in appendix D, but, again, results do not portray a clear advantage. 

We have also explored some other methods based on reliability or confidence. Appendix E explores one 
of the best reliability estimation methods, CNK, included in a recent survey by Bosnic & Kononenko's [8] 
about reliability measures in regression. Also, it is meant to output an estimation of the standard deviation. 
However, the results are poor. Nonetheless, in appendix E we introduce a 'correction', known as KNC, 
which is just based on comparing the estimated mean with the closest k true values in the training (or 
validation) set. The results for KNC are better, which has spurred us to introduce a univariate version uKNC 
that we will see below, as an enrichment method. Finally, we explored conformal prediction in appendix F, 
which outputs confidence intervals, but the results were not better than the rest. 

We will evaluate a selection of these methods at the end of this section. 

3.3 NCDE through enrichment methods 

One of the problems of the previous methods is that they depend on the whole training set for estimating the 
conditional variance. This looks natural, since in order to get 6{x), we are supposed to need x. However, 
if we have a regression model, we already have y, which actually carries information about the input value 
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training context training data crisp model 

learning 

^ \ m(x) 





# enrichment y 



\Mx)l mo^el 




Figure 3: Enrichment methods convert a crisp regression model into a soft regression model by just com- 
paring y with y. The mean of the resulting conditional density function f(y\x) is not altered (the original y 
is kept). Only the second parameter of a normal distribution (the conditional variance) is added. 

x. The basic idea of an 'enrichment' method is to derive 6{x) from y instead of deriving it from x. With 
the original y and the newly derived o{x) we just have a NCDE model. Figure 3 shows this process of 
converting a crisp model into an enriched soft model. This univariate derivation can be performed in several 
different ways. 

A first option is to estimate the residual u = y — y and derive 6(x) from it. This procedure, which does 
a univariate regression on the residuals given the outputs, is called residual-based enrichment, RBE. We 
detail the RBE procedure below: 

Definition 3. Given an existing regression model (m y ), a training or validation set T, and a (test) instance 
x, the residual-based enrichment method (RBE) is defined as follows: 

1. Obtain yi = m y (xi)for each example (xi,yj) G T. 

2. Calculate the residuals: Ui <— (yi — yi). 

3. Apply a transformation function 6 to the residuals: v,- <— d(ui). 

4. Train a regression model m v for the dataset V = {(Si, v ; -)}. 

5. Obtain y = m y (x) and v = m v (y)for the example x to be predicted (in the test set). 

So, for each example x in the test set, the estimated conditional mean for that example is /t (x) = y and the 
estimated conditional standard deviation is c{x) = (v). Note that steps 1 to 4 can be omitted if we just 
train and keep m v . 

The procedure is similar to the conditional variance estimation methods shown in appendix D, but we re- 
move the dependency on x for the residual model. This procedure also resembles some calibration methods 
in classification. Piatt's method [53] applies a univariate function (a sigmoid) to the outputs, in order to 
calibrate them. Finally, it also slightly resembles some the idea of mimetic models [30, 21] 

In order to apply the RBE method, we only need to choose an appropriate transformation function 6 
for step 3 and a regression technique for step 4. The transformation function 6 can be used at convenience 
to ensure that c(x) is always positive or to make an estimation of absolute or squared residuals. Several 
possibilities exist, but a natural choice is d(t) = t 2 , if seen as a variance estimation method [69, 67]. 
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LR LR 
ENRENR 
kNNkNN 
msll msvr 


LR LR 
ENRENR 
Tree Tree 
msll msvr 


kNN kNN 
ENRENR 
kNNkNN 
msll msvr 


kNNkNN 
ENRENR 
Tree Tree 
msll msvr 


Tree Tree 
ENRENR 
sNNkNN 
msll msvr 


Tree Tree 
ENRENR 
Tree Tree 
msll msvr 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

i t 

14 
15 
16 
17 
18 
19 
20 


0.78 0.49 
0.78 0.55 
0.53 0.69 
0.71 0.57 
0.63 0.54 
0.71 0.62 
0.76 0.68 
0.64 0.53 
0.95 0.91 
0.69 0.39 
0.72 0.51 
0.86 0.56 

U.J / U.D4 
74 PI 46 

J. / *T U.tU 

0.75 0.64 
0.76 0.55 
0.98 0.88 
0.74 0.54 
0.69 0.52 
0.79 0.59 


0.78 0.51 
0.78 0.54 
0.55 0.71 
0.72 0.58 
0.63 0.53 
0.70 0.61 
0.76 0.63 
0.64 0.53 
0.94 0.90 
0.70 0.43 
0.74 0.57 
0.86 0.57 

U.Jo U.00 

) 74 o 47 
0.74 0.61 
0.75 0.52 
0.97 0.87 
0.74 0.56 
0.69 0.55 
0.79 0.61 


0.80 0.48 
0.77 0.54 
0.61 0.63 
0.83 0.60 
0.63 0.52 
0.65 0.57 
0.77 0.62 
0.62 0.54 
1.00 0.94 
0.61 0.52 
0.76 0.68 
0.76 0.49 
/I on n £i 

J.yV U.Oj 
(171 SO 

U. 1 1 U.Ju 

0.75 0.61 
0.71 0.54 
0.80 0.59 
0.75 0.55 
0.70 0.51 
0.68 0.53 


0.80 0.50 
0.76 0.54 
0.62 0.63 
0.81 0.47 
0.64 0.53 
0.66 0.59 
0.76 0.60 
0.61 0.53 
0.99 0.90 
0.67 0.65 
0.76 0.69 
0.76 0.53 
n on n 

V.yV U.DJ 
(171 SO 

U. 1 1 U.JU 

0.73 0.57 
0.72 0.56 
0.83 0.73 
0.75 0.56 
0.70 0.51 
0.69 0.55 


D.84 0.62 
0.69 0.55 
0.69 0.63 
0.77 0.50 
0.70 0.64 
0.68 0.67 
0.76 0.57 
0.64 0.62 
1.00 0.96 
0.60 0.59 
0.77 0.76 
0.83 0.74 

n qo n on 
J.yL U.oU 

1 70 f\1 
J . 1 u u.UJ 

0.82 0.71 
0.68 0.57 
0.77 0.67 
0.76 0.60 
0.71 0.58 
0.72 0.62 


0.85 0.68 
0.66 0.52 
0.60 0.46 
0.81 0.61 
0.63 0.51 
0.62 0.57 
0.76 0.56 
0.58 0.54 
1.00 0.95 
0.57 0.56 
0.68 0.65 
0.72 0.56 

n on n fn 
J.yV U.O/ 

1 6S S 1 

U.UJ U.J 1 

0.72 0.56 
0.68 0.58 
0.76 0.63 
0.74 0.54 
0.69 0.55 
0.69 0.58 


Mean 


0.74 0.59 


0.74 0.60 


0.74 0.58 


0.74 0.59 


0.75 0.65 


0.72 0.59 



Table 3: Results (using the datasets in Table 1 1) for several base techniques (LR, kNN and Tree) with the 
residual-based enrichment (RBE) methods using kNN and Tree as models for the residuals. All the methods 
use 6(t) = t 2 . Results for mrse are not shown since they are equal to Table 2. 



We explore the RBE method for the base techniques (LR, kNN and Tree) and two methods for calculating 
the residuals (kNN and Tree). Table 3 shows the results. We see that the results are now similar for all the 
base techniques, quite differently to the results in Table 2. 

Given that enrichment only requires a univariate regression technique, we can look for simpler and 
equally effective approaches, without the need of using a second regression technique, such as kNN and 
Tree. A single approach is binning, which just uses a sliding window over the estimated value y. This 
approach resembles binning calibration in classification [3] More formally, the BIN method is defined as 
follows: 

Definition 4. Given an existing regression model (m y ), a training or validation set T, and a (test) instance 
x, the enrichment method BIN is defined as follows: 

1. Obtain y/ = m y (xi)for each example (jt;,y;) G T . 

2. Calculate the residuals: m <— (y,- — ft). 

3. Apply a transformation function 6 to the residuals: v, B(u{). 

4. Construct a dataset V = { (y,- , V,) }. 

5. Sort V by y ; -. 

6. Obtain y = m y (x)for the example x to be predicted (in the test set). 

7. Construct the set W with the k/2 values v* in V immediately above y and the k/2 values v; in V 
immediately below 2 . 

8. Obtain v as the mean ofW. 



If there are not sufficient elements above or below we take as many as we can. 
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The estimated conditional mean is jl(x) = y and the estimated conditional standard deviation is 6{x) = 
Q~ l (v). Note that steps 1 to 4 can be omitted (and the training set is no longer necessary) if we just keep 
the dataset V when training the model. 

A third enrichment method can be defined by constructing the bins using distances and then averaging 
the deviation of the true values against the prediction (instead of averaging the residuals). This method is an 
univariate version of the method KNC (see appendix E): 

Definition 5. Given an existing regression model (m y ), a train or validation set T, and a (test) instance x, 
the univariate /^-nearest comparison enrichment method uKNC is defined as follows: 

1. Obtain y/ = m y (xi)for each example (jC;,y;) £ T. 

2. Construct a dataset Q = {($i,yi)}- 

3. Obtain y = m y (x)for the example x to be predicted (in the test set). 

4. Let S = {Sj,yj) the set of the k nearest neighbours in Q (using the distance |y ; - — y| between each y ( - in 
Q and the fixed y). 

5. Obtain s 2 = |E^,, y .) eS (y '-> if- 

The estimated conditional mean is jX(x) =y and the estimated conditional variance is o(x) 2 = s 2 . 

Note that this method is different froom RBE using kNN. The method uKNC just looks for the closest 
estimations in the training set to the estimation for example x and compares their true values with the 
estimation for x. Note that the rationale behind this method is that we link the variance to the estimations, 
i.e., given a set of k examples with similar estimations, we calculate how far (on average) the true values are 
to the centre estimation. By using the centre estimation and not the estimation for each of the k examples 
with most similar estimations, this method can be more robust (since an outlier estimation for one of the k 
estimations has no effect on the result). 

Again, we apply these two methods (BIN and uKNC) to the base techniques (LR, kNN and Tree). Table 
4 shows the results. As we see, the performance is not degraded at all by these extremely straightforward 
and efficient methods. Much on the contrary, their results are good, especially for uKNC. 

3.4 Choosing some appropriate NCDE methods for cost-sensitive applications 

As we have seen, the number of possible methods which can be used to derive a simple (i.e. normal) condi- 
tional density function is really large, and some of them could be parameterised and refined. Nonetheless, 
our goal was to select a small set of simple NCDE methods that could produce a reasonably good normal 
conditional density estimation or, more precisely, a good pair of conditional mean and conditional variance 
(from which a normal conditional density estimation is built). 

In order to make a selection, we have analysed some of the methods seen so far in order to find a small 
subset of methods with the following criteria: good performance (for any base regression technique), low 
dependence on the training set and efficiency. Performance results and significance tests are shown in tables 
17, 18 and 19 in appendix G. These tables also include the results for some of the methods mentioned in 
section 3.2 (which are explained in full detail in appendices C, D, E and F). According to these results and 
the previous criteria, we decide to use the following NCDE methods: 

• Own: uses the own variance estimation methods from each base regression technique (section 3.1). 

• uKNC: uses the univariate k-nearest comparison enrichment method (definition 5, section 3.3). 
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LR LR kNN kNN 
ENR ENR ENR ENR 



uKNGiKNC iiKNQKNC EiKNGiKNC B IN 



10 
11 

12 
13 
14 
15 
16 
17 
18 
19 
20 



msll msvr msll msvr 



Tree Tree 
ENR ENR 



msll msvr 



0.78 0. 

0.78 0. 

0.57 0. 

0.77 0. 

0.63 0. 

0.71 0. 

0.75 0. 

0.64 0. 

0.73 0. 

0.67 0. 

0.72 0. 

0.85 0. 

0.72 0. 

0.74 0. 

0.75 0. 

0.74 0. 

0.91 0. 

0.74 0. 

0.69 0. 

0.79 0. 



49 0.80 

54 0.77 

82 0.63 

70 0.84 

54 0.63 
61 0.65 

55 0.77 
53 0.61 
65 0.99 
30 0.60 
51 0.74 
49 0.76 
82 0.90 
46 0.71 
64 0.75 
48 0.71 

51 0.80 
55 0.75 

52 0.70 
58 0.68 



0.48 
0.56 
0.66 
0.62 
0.52 
0.57 
0.62 
0.53 
0.91 
0.49 
0.64 
0.50 
0.65 
0.51 
0.61 
0.53 
0.58 
0.56 
0.51 
0.52 



LR 
ENR 



msll 



0.84 0.64 

0.68 0.56 

0.64 0.52 

0.77 0.49 

0.63 0.53 

0.64 0.61 

0.75 0.49 

0.59 0.54 

0.93 0.78 

0.56 0.49 

0.68 0.64 

0.70 0.50 

0.89 0.63 

0.65 0.51 

0.74 0.60 

0.69 0.58 

0.77 0.65 

0.74 0.55 

0.70 0.58 

0.69 0.59 



Mean0.73 0.57 0.74 0.58 3.71 0.57 0.74 0.59 3.74 0.57 3.72 0.59 



LR kNN 
ENR ENR 
BIN BIN 
msvr msll 



0.78 
0.78 
0.54 
0.72 
0.63 
0.71 
0.75 
0.64 
0.95 
0.69 
0.72 
0.86 
0.58 
0.74 
0.75 
0.75 
0.98 
0.74 
0.69 
0.79 



0.50 0.80 

0.54 0.77 

0.70 0.64 

0.58 0.81 

0.54 0.63 

0.62 0.66 

0.67 0.77 

0.53 0.61 

0.91 0.99 

0.38 0.60 

0.51 0.75 

0.55 0.75 

0.65 0.89 

0.46 0.71 

0.65 0.75 

0.54 0.71 

0.89 0.80 

0.54 0.75 

0.52 0.70 

0.59 0.68 



kNN Tree 
ENR ENR 
BIN BIN 
msvr msll 



0.48 0.84 

0.55 0.66 

0.66 0.63 

0.52 0.81 

0.52 0.63 

0.59 0.65 

0.61 0.76 

0.53 0.59 

0.92 1.00 

0.48 0.56 

0.65 0.68 

0.49 0.74 

0.59 0.90 

0.51 0.65 

0.61 0.73 

0.54 0.68 

0.62 0.76 

0.55 0.74 

0.51 0.69 

0.52 0.68 



Tree 
ENR 
BIN 

msvr 



0.63 
0.52 
0.52 
0.64 
0.52 
0.62 
0.58 
0.55 
0.95 
0.47 
0.64 
0.60 
0.66 
0.53 
0.57 
0.57 
0.65 
0.53 
0.55 
0.57 



Table 4: Results (using the datasets in Table 11) for several base methods (LR, kNN and Tree) with the 
enrichment method uKNC and the enrichment method using binning for the residuals (BIN). Method BIN 
uses 6(t) = t 2 . Results for mrse are not shown since they are equal to Table 2. 



• BIN: uses the residual-based enrichment method using binning (definition 4, section 3.3). 

Note that we make a selection for the clarity of exposition. If other NCDE methods (either directly or 
by enrichment) are eventually found to perform better or more efficiently (in general or for a particular 
problem), this will give further support for (and improve) the probabilistic reframing methods that we will 
explore in the following sections. 

4 Bid applications 

As explained in the introduction, most regression problems require the minimisation of a loss function rep- 
resenting the cost context, rather than an uncontextualised (quadratic) error. Frequently, this loss function is 
only known at deployment time, so training is usually performed without this information, as shown in Fig- 
ure 1 . Given the selection of NCDE methods at the end of previous section, we are ready to apply reframing 
to several kinds of cost-sensitive decision problems, where particular families of loss functions are used. 
For instance, in this section we will explore a family of problems which are very common in econometrics, 
commerce and retailing applications, where we need to estimate the price (or other quantifiable features) 
for an offer or bid in the context of a sale, deal or auction ([59, 18, 47, 68, 36]). One of the most relevant 
features of the loss functions in these applications is that they are highly discontinuous, since an offer which 
is much too expensive changes loss dramatically: from the maximum attainable benefit to no benefit at all 
(the offer is not accepted). This is formalised by the following bid loss function: 

Definition 6. The bid loss £^ is a loss function defined as follows: 

if(y,y) = -9+P ify<y 

= otherwise 
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Bid loss function (P-3) 



Bidneg loss function (fi-3) 




Figure 4: Two loss functions shown as a scatter plot with actual output values on the ;t-axis and estimated 
output values on the y-axis. Costs are shown with contour lines and colours (from benefits to high costs 
represented with the scale green-yellow-reddish- white). Left: Bid loss function with j6 = 3. Right: Bidneg 
loss function with /3 = 3. 



where j6 represents some kind of base cost. Ify > f5 then we have positive profits. 

Figure 4 (left) shows a representation of this loss function with /3 = 3. Given the bid loss and a NCDE 
model f(y\x), we need to determine the optimal local reframing, which get the lowest expected loss (mini- 
mum risk). This can be done as follows: 

Proposition 2. Given i\, 

r%x,£fJ)=aigmm{(p-t)(l-P(t\x))} 
Proof. From eq. (3), we have that r* (x, £%,f) can be written as follows: 

/oo 
d(t,y)f(y\x)dy 
-oo 

= argmin||^0 + ^ O °(-? + i3)/(3;|x)^| 

= argmin{03-O(l-^W)} 
t 

□ □ 

The previous equation has no closed form in general (and it does not reduce either for the normal 
distribution). For most distributions (exceptions are fat-tailed distributions, such as the Cauchy distribution), 
the value of the estimated cumulative distribution function F(t\x)) goes to 1 faster than t grows to infinity. 
So this expression is for t — > oo. Hence, in general, it only has a minimum. We can find the minimum 
of the previous function numerically or we can calculate the derivative (t — fi)f(t\x) +F(t\x) — 1 and try to 
find (also numerically) the values which make the expression and see which of them are minima. The first 
option seems the easiest one, especially for a normal distribution. While the loss function is discontinuous, 
the expression in proposition 2 is not, and some efficient numerical methods can be used. 
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Apart from the local reframing using NCDE methods and proposition 2 seen above, we will compare 
with a global reframing which just uses the expected value (a crisp regression model) and adds a shift which 
has been optimised for the training set. The methods are then: 

• None: No reframing. The prediction (conditional estimated mean) is used as it is. 

• Own, uKNC, BIN: Probabilistic (local) reframing using a numerical approximation for the expression 
r*(x,£%,f) with a normal distribution, as given by proposition 2. The conditional normal density 
is obtained by three different methods: as derived by the base technique (Own) and the enrichment 
methods uKNC and BIN. 

• CoSh: Global reframing using a constant shift so for all the predictions: R + (x,£p,f) = E^(y|x) + so- 
In order to calculate a good shift, we look for the best shift for the whole training set . The calculation 
of so can be done numerically. Since £\ is discontinuous, we cannot use many optimisation methods 
and we need to use a Monte Carlo algorithm or a resolution-bounded covering algorithm, assuming 
that the solution is inside a (wide) interval. 

Now we see how the previous methods perform. We will use several values of j8 using the equation j3 = 
(maXy — miriy) ■ a , where a ranges regularly between and 1, and max y and min y are the maximum and 
minimum values of the output y for the whole dataset. The equation tries to capture a range of reasonable 
cases for this family of problems. The rationale is that high values of j8 imply that benefits can only be 
obtained with values of y which get close to max y , while low values of /3 imply that we will almost always 
get benefits. This is the reason why we have squared a, in order to make cases with low j8 more frequent, if 
we just choose a regularly. With this, we explore different reasonable possibilities for j8. 
Figure 5 (left) shows the evolution of this loss for different methods and different values of /3 (which is a 
function of a) for one dataset as an illustration (the figures may vary significantly for other datasets). 

The overall results (Table 5) show that an appropriate cost-sensitive probabilistic (local) reframing out- 
performs a constant shift method (global reframing). The results are consistent for the three different base 
techniques (LR, kNN and Tree). 

After this first loss function and its results for different methods, we can of course figure out other related 
loss functions. For instance, a common variant of the bid loss is when the decision rule does not make a bid 
if we expect no benefit. This is not a rejection rule, which we will see in section 6, but means that there is 
no offer, no sale and, hence, no profit or loss. In many applications, this is a more realistic loss function, and 
can be defined as follows. 

Definition 7. The non-losing bid loss £^ is a loss function defined as follows: 

%($,y) = S+P if(y<y)^(P<y) 

= otherwise 

Figure 4 (right) shows a representation of this loss function with j3 = 3. We can get its optimal reframing 
as we did for £\ : 

Proposition 3. Given 

r*(x,^,/) = argmin{(j3-0(l-F(max(i3,OW)} 



Note that this method does not require the estimation of a conditional normal distribution (only the expected value is needed), so 
any crisp regression method can be used directly. However, it requires the complete training set (or at least the actual output values 
y) for every new context (loss function). This might not be possible in many applications. It also assumes the same parameters for 
the loss function for the whole dataset. 
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Figure 5 : Left: comparing the bid loss using different methods for dataset rock with base technique kNN. 
Right: comparing the bidneg loss using different methods for dataset menarche with base technique LR. 



Proof. From eq. (3), we have that r* (x, £p,f) can be written as follows: 

/co _ 
£ B Jt,y)f(y\x)dy 
-<*> 

= argmin / 0+/ (-t + p)f(y\x)dy\ 

= argmin{(/3-0(l-^(max(/3,OM)} (4) 

t 

□ □ 

Figure 5 (right) shows the evolution of this loss for different values of j3 (which is a function of a) for 
one dataset. The overall results for this variant are shown in Table 6 for several base techniques, with the 
same configuration as Table 5. The results are similar to those in Table 5, although the differences between 
the methods are now smaller since there are many more cases where the loss is 0. Again this shows that the 
use of an appropriate cost-sensitive probabilistic reframing gets better results than a constant shift method 
(global reframing). The results are again consistent for different base techniques. 

While we only show the results for some typical bid functions as for definitions 6 and 7, the same idea 
can be used in applications where there can be more bids (see, e.g., [4]), or when auctions work in a different 
way. In that case, a similar result to propositions 2 and 3 could be obtained by applying the maximisation 
recursively. In the cases where the expressions cannot be simplified into a closed form, as in this case, we 
can use a numerical method. 

As mentioned above, the global reframing method cannot be applied when the loss function parameters 
may be different for each example. For instance, the j8 parameter of the bid function may depend on the 
instance, such as cases where this represents the production cost and is different for each product. The 
probabilistic reframing can still be applied in these cases. 
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10 



13 



LR 

None 

2.45 

10.85 

6.99 

6.64 

17.93 

6.50 

3.42 

8.25 

19.96 

3.75 



11 9.03 

12 9.90 



17. IS 



14 6.83 



13.24 
15.98 
1.14 



18 2.17 

19 6.83 

20 9.07 



LR 

Own 

-0.19 

0.26 

-0.42 

1.89 

-0.23 

-0.16 

1.97 

0.10 

0.58 

0.01 

5.58 

-0.37 

0.48 

0.95 

-0.16 

3.32 

0.06 

1.48 

-0.50 

-0.58 



LR LR 
uKNCBIN 
-0.75 -0.77 
-0.25 -0.23 
-0.56 -0.52 
0.51 3.68 
-0.98 0.37 
-0.42 -0.41 
-0.20 -0.24 
-0.32 0.34 
-0.09 0.58 
-0.54 -0.18 
3.31 5.58 



-0.78 
-0.41 
-0.24 
-0.10 
-0.37 



■0.79 
■0.41 
■0.26 
■0.14 
■0.37 



■0.29 0.18 
■0.29 -0.36 
■0.37 -0.51 
■0.61 -0.56 



LR 

CoSh 

-0.05 

0.97 

-0.39 

5.61 

13.17 

0.06 

1.96 

7.34 

15.45 

2.40 

8.47 

0.02 

-0.06 

0.67 

0.42 

-0.15 

0.91 

1.07 

0.51 

2.62 



AR5.00 2.75 1.45 2.05 3.75 



kNN kNN 
None Own 
3.97 -0.33 
7.25 -0.29 
6.08 -0.64 
9.14 3.52 
10.14 2.46 
5.70 -0.37 
5.54 -0.27 
10.90 0.63 
11.94-0.23 
3.58 -0.51 
9.93 5.58 
9.29 -0.77 
16.55 
7.56 
9.02 
17.37 
2.89 



-0.38 
-0.23 
-0.30 
-0.41 
0.60 



1.95 -0.32 
9.46 -0.51 
8.05 -0.59 



kNN kNN 
uKNCBIN 
-0.37 -0.38 
-0.25 -0.34 
-0.65 -0.64 
3.61 3.75 
3.26 2.35 
-0.42 -0.36 
-0.30 -0.31 
0.65 0.64 
-0.16 -0.07 
-0.48 -0.48 
5.58 5.58 
-0.75 -0.76 
-0.37 -0.38 
-0.32 -0.32 
-0.22 -0.33 
-0.37 -0.39 
0.68 0.65 
-0.31 -0.34 
-0.62 -0.54 
-0.53 -0.61 



kNN 

CoSh 

0.88 

0.12 

-0.40 

6.74 

8.40 

-0.20 

1.64 

9.98 

0.16 

-0.16 

9.30 

-0.14 

-0.25 

0.26 

0.34 

-0.27 

1.89 

1.04 

-0.07 

0.19 



5.00 1.80 2.30 1.90 4.00 



Tree 

None 

3.78 

13.82 

6.81 

9.47 

9.93 

6.22 

3.42 

10.58 

5.82 

1.87 

9.92 

10.81 

18.88 

8.64 

8.44 

16.60 

2.80 

1.63 

4.02 

8.64 



Tree 

Own 

-0.39 

0.09 

-0.52 

3.87 

2.52 

-0.24 

-0.09 

0.65 

-0.23 

-0.56 

5.58 

-0.67 

-0.34 

0.00 

-0.17 

-0.37 

0.84 

0.33 

-0.51 

-0.50 



Tree Tree 
uKNCBIN 
0.01 -0.30 
0.11 0.16 
-0.57 -0.46 
4.17 4.31 
1.54 2.61 
-0.38 -0.21 
0.13 0.08 
0.55 2.88 
-0.18 -0.14 
-0.60 -0.51 
5.58 5.58 
-0.67 -0.67 
-0.36 -0.37 
-0.02 0.10 
-0.25 -0.13 
-0.36 -0.33 
0.60 0.93 
-0.36 0.58 
-0.55 -0.49 
-0.51 -0.48 



Tree 

CoSh 

0.72 

0.87 

0.32 

6.88 

7.17 

0.14 

2.24 

10.07 

0.43 

-0.17 

9.58 

1.13 

0.25 

2.71 

0.82 

-0.21 

2.22 

1.31 

0.26 

0.90 



5.00 1.75 1.50 2.75 4.00 



Table 5: Results for the bid loss £^ for the datasets in Table 12, using the experimental methodology in 
section 2.5. Each row aggregates the folds and ten different values for j8 per fold using the formula j8 = 
(maxy — mitiy) • a 2 with a € {0,0.111,0.222, . . . , 1}. For visibility all the losses are multiplied by 10. Each 
section of five columns shows results for different base techniques (LR, kNN and Tree). The average ranks 
(AR) are calculated for these three groups separately. The Friedman statistics for the three sections are 
(63.44, 65.12 and 71 respectively), which are greater than the Critical Value (10.92). This means that the 
null hypothesis is rejected (significance level: 0.05) and the methods do not perform equally. Differences 
in average ranks higher than the critical difference for the Nemenyi post-hoc test (0.3626) imply that the 
difference is significant (in bold). 



5 Asymmetric loss applications 

As mentioned in the introduction, many regression problems do not have a symmetric loss. Depending on 
the application, overestimations might be worse than underestimations (or vice versa). The way in which 
this asymmetry is modelled has led to the definition of many asymmetric loss functions, such as Lin-Exp 
(approximately linear on one side and exponential on the other side), Quad-Exp (approximately quadratic 
on one side and exponential on the other side), Lin-Lin (asymmetric linear) and Quad-Quad (asymmetric 
quadratic). We will focus on the latter two since these are more common and can be seen as generalisations 
of absolute error and quadratic error respectively. 

First, we give a definition for the asymmetric absolute error i^. 

Definition 8. The asymmetric absolute error £^ is a loss function defined as follows: 

£ A a (y,y) = a(y-y) if y <y 
= (1 — ot)(y— y) otherwise 

with a being the cost proportion (or asymmetry) between and 1 , with increasing values meaning higher 
cost for low predictions (underestimation). In other words, when a = we mean that predictions below the 
actual value have no cost. When a = 1 we mean that predictions above the actual value have no cost. When 
a = 0.5 we mean that costs above and below are symmetric. 

Similarly, we give the definition for the asymmetric squared error: 
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1 

2 
3 
4 
5 
6 
7 
8 
9 
10 



12 
13 
14 
15 
16 
17 
18 
19 
20 



LR 

None 
0.64 
0.07 
0.30 

0.00 
0.29 
0.26 
0.08 
0.11 
0.10 
0.22 



11 D.00 



0.59 
0.11 
0.01 
0.19 
0.29 
0.20 
0.24 
0.25 
0.39 



LR 

Own 

-0.71 

-0.19 

-0.42 

-0.06 

-0.29 

-0.35 

-0.11 

-0.10 

-0.10 

-0.22 

0.00 

-0.77 

-0.19 

-0.09 

-0.28 

-0.28 

-0.21 

-0.21 

-0.50 

-0.58 



LR LR 
uKNCBIN 
-0.75 -0.77 
-0.30 -0.28 
-0.56 -0.52 
-0.18 -0.01 
-0.98 -0.28 
-0.42 -0.41 
-0.23 -0.24 
-0.32 -0.10 
-0.09 -0.10 
-0.54 -0.24 
-0.08 0.00 
-0.78 -0.79 
-0.41 -0.41 
-0.32 -0.30 
-0.17 -0.26 
-0.39 -0.39 
-0.29 -0.20 
-0.29 -0.36 
-0.37 -0.51 
-0.61 -0.56 



LR 

CoSh 

-0.73 

-0.20 

-0.49 

0.00 

-0.01 

-0.20 

-0.31 

-0.07 

-0.13 

-0.47 

0.00 

-0.79 

-0.35 

-0.47 

-0.15 

-0.39 

-0.18 

-0.52 

-0.25 

-0.36 



kNN 
None 
0.26 
0.06 
0.30 
0.00 
0.00 
0.03 
0.14 
0.00 
■0.03 
0.03 
0.00 
■0.67 
■0.09 
0.03 
0.34 
0.29 
0.04 
0.32 
0.15 
0.31 



kNN 
Own 
-0.35 
-0.29 
-0.64 
-0.02 
-0.02 
-0.37 
-0.27 
-0.01 
-0.23 
-0.51 
0.00 
-0.77 
-0.38 
-0.34 
-0.30 
-0.41 
-0.05 
-0.32 
-0.51 
-0.59 



kNN kNN 
uKNCBIN 
-0.37 -0.38 
-0.30 -0.34 
-0.65 -0.64 
-0.02 -0.02 
-0.02 -0.03 
-0.42 -0.36 
-0.30 -0.31 
-0.01 -0.01 
-0.16 -0.14 
-0.48 -0.48 
0.00 0.00 
-0.75 -0.76 
-0.40 -0.40 
-0.32 -0.32 
-0.22 -0.33 
-0.39 -0.40 
-0.04 -0.05 
-0.31 -0.34 
-0.62 -0.54 
-0.53 -0.61 



kNN 

CoSh 

-0.55 

-0.36 

-0.40 

0.00 

0.00 

-0.35 

-0.34 

0.00 

-0.19 

-0.54 

0.00 

-0.69 

-0.33 

-0.55 

-0.27 

-0.34 

-0.08 

-0.47 

-0.56 

-0.44 



AR3.90 3.40 1.85 2.50 3.35 4.30 2.50 2.90 2.35 2.95 4.10 2.10 1.90 3.00 3.90 



Tree 
None 
0.30 
0.05 
■0.19 
0.00 
0.00 
0.20 
■0.11 

p.oo 

0.09 
■0.24 
0.00 
0.60 
0.12 
0.02 
0.27 
0.32 
0.06 
0.41 
0.36 
0.35 



Tree 
Own 
-0.39 
-0.25 
-0.52 
-0.01 
-0.02 
-0.34 
-0.21 
-0.01 
-0.23 
-0.56 
0.00 
-0.67 
-0.37 
-0.21 
-0.27 
-0.37 
-0.06 
-0.35 
-0.51 
-0.50 



Tree Tree 
uKNCBIN 



-0.33 
-0.23 
-0.57 
-0.01 
-0.05 
-0.38 
-0.19 
-0.01 



■0.36 
■0.22 
■0.46 
■0.00 
■0.02 
■0.30 
■0.19 
■0.00 



-0.18 -0.18 
-0.60 -0.51 
0.00 0.00 
-0.67 -0.67 
-0.38 -0.37 
-0.21 -0.19 



-0.25 
-0.36 
-0.07 
-0.36 
-0.55 
-0.51 



■0.28 
■0.36 
■0.06 
■0.35 
■0.49 
■0.48 



Tree 

CoSh 

-0.56 

-0.12 

-0.49 

0.00 

0.00 

-0.16 

-0.15 

0.00 

-0.31 

-0.55 

0.00 

-0.59 

-0.30 

-0.04 

-0.22 

-0.32 

-0.06 

-0.37 

-0.48 

-0.28 



Table 6: Results for the bidneg loss for the datasets in Table 12, using the experimental methodology 
in section 2.5. Each row aggregates the folds and ten different values for j8 per fold using the formula 
j8 = (maxy — miriy) ■ a 2 with a £ {0,0.111,0.222, . . . , 1}. For visibility all the losses are multiplied by 10. 
Each section of five columns shows results for different base techniques (LR, kNN and Tree). The average 
ranks (AR) are calculated for these three groups separately. The Friedman statistics for the three sections 
are (21.32, 19 and 32.32 respectively), which are greater than the Critical Value (10.92). This means that the 
null hypothesis is rejected (significance level: 0.05) and the methods do not perform equally. Differences 
in average ranks higher than the critical difference for the Nemenyi post-hoc test (0.3626) imply that the 
difference is significant (in bold). 



Definition 9. The asymmetric squared error l s a is a loss function defined as follows: 

£ s a (y,y) = a(y-y) 2 ify<y 
= (1 — cc)(y — y) 2 otherwise 

Figure 6 (left and right) shows a representation of these two loss functions for a = 0.8. 

Now, we look for the optimal choice in both cases. The case for is relatively straightforward: 

Proposition 4. If £ is the asymmetric absolute error function t^{y,y) given by definition 8 and f(y\x) is any 
conditional distribution (whose mean is denoted by fl(x)), then the expected loss for a predicted value t is 
given by: 

Jf(x,t,f,£ A a ) = afL(x)+tF(t\x)-at- [ yf(y\x)dy 

J — CO 

where F is the cumulative distribution for f. 

From now on, we omit all the proofs, which can be found in appendix H. 
The previous expression can be used to obtain the optimal prediction easily: 
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ABS loss function (a=0.8) SQU loss function (a=0.8) 




Figure 6: Two loss functions shown as a scatter plot with actual output values on the ;t-axis and estimated 
output values on the y-axis. Costs are shown with contour lines and colours (from benefits to high costs 
represented with the scale green-yellow-reddish-white). Left: Asymmetric absolute (Lin-Lin) loss function 
(ABS) with a = 0.8. Right: Asymmetric squared (Quad-Quad) loss function (SQU) with a = 0.8. 

Proposition 5. If i is the asymmetric absolute error function i^(y,y) given by definition 8 and f(y\x) is any 
conditional distribution, then the optimal prediction is given by the value t such that the following equality 
holds: 

F(t\x) = a (5) 

where F is the cumulative distribution for f. 

Clearly, the previous result can be instantiated for any distribution, whose cumulative distribution is 
invertible, and get: 

F-\a\x) (6) 

where F~ l is the inverse of the cumulative distribution for /. If / is a normal distribution then we can just 
use the quantile function (or probit function). 

The expression in eq. 6 is easy and intuitive. For a normal distribution, if we have a = predictions 
below the actual value have no cost, so the best thing to do is to predict — °°, since the quantile function 
returns this for p = 0. When a = 1 predictions above the actual value have no cost, so the best thing to do is 
to predict oo. If a = 0.5 the best result is given by the result of the quantile function for 0.5, i.e., the median, 
which for a normal distribution is also the mean. 

As in the previous section, we compare the method without reframing (None), with probabilistic (local) 
reframing and two methods with global reframing. 

• None: No reframing. The prediction (conditional estimated mean) is used as it is. 

• Own, uKNC, BIN: Probabilistic (local) reframing using r*(x,£^,f), as given by eq. 6. The conditional 
normal density is obtained by three different methods: as derived from the base technique (Own) and 
enrichment methods (uKNC and BIN). 

• CoSh: Global reframing using a constant shift so for all the predictions R + (x,£^,f) = E^(y|jc) + so- 
In order to calculate a good shift, we look for the best shift for the whole training set. Interestingly, in 
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a a 

Figure 7: Left: comparing the asymmetric absolute loss using different methods for dataset iris3 with base 
technique LR. Right: comparing the asymmetric square loss using different methods for dataset road with 
base technique Tree. 



this case, the calculation of the optimal constant shift for the training data follows a convex function 
(from a convex loss function) and can be calculated using efficient numerical methods. For instance, 
[1] use hill climbing to calculate this optimal so- 

• PoSh: Global reframing using a polynomial shift s(x) for all the predictions R p (x,£„,f) = s(Kj:(y\x)) 
where s is a polynomial function. Considering that the problem is convex, [72] present a numerical 
method (also based on hill climbing) to derive this polynomial in a relatively efficient way. We will 
just show the results for a first-order polynomial because this degree produced the best results. 

Using these methods, the evolution of this loss for different values of a for one dataset is shown in Figure 
7 (left). The overall results are shown in Table 7 for several base techniques. Here we see that the global 
reframing methods perform relatively well too, especially for the base technique using trees. 

Now we will derive the minimisation expression for the asymmetric squared error (definition 9): 

Proposition 6. If £ is the asymmetric squared error function £ s a {y,y) given by definition 9 and f is any 
distribution with mean jx(x) and standard deviation 6{x), then the expected loss for a predicted value t is 
given by: 



(I -2a) 



t 2 F(t\x)-2t I yf(y\x)dy+ I y z f(y\x)dy 



+ a[t 2 -2tfi(x) + fL2(x)] (7) 



where faix) is the second raw moment of f(y\x). 

Proposition 7. If £ is the asymmetric squared error function £ s a {y,y) given by definition 9 and f is any 
distribution with mean jl(x) and standard deviation 6{x), then the optimal prediction is given by the value 
t such that the following equation holds: 



(l-2a) 



2tF(t\x) 



yf{y\x)dy 



+ 2at - 2ajx(x) = 



(8) 



24 



LR 

None 
1.40 
4.47 
2.55 
4.18 
2.62 
2.18 
6.01 

8 1.72 

9 7.38 

10 4.89 

11 6.58 

12 1.27 

13 3.84 

14 6.00 

15 2.28 

16 2.47 

17 20.18 

18 7.27 

19 2.74 

20 2.97 



LR 

Own 

0.97 

3.02 

1.73 

3.28 

1.92 

1.57 

4.59 

1.31 

5.38 

3.45 

5.15 

0.95 

2.80 

4.37 

1.52 

1.86 

15.72 

5.56 

1.74 

1.91 



LR LR 
uKNCBIN 
0.95 0.94 



2.80 
1.69 
3.22 
1.65 
1.69 
4.26 
1.18 
4.34 
2.72 
4.90 
0.80 
2.47 
3.93 
1.53 
1.56 



2.87 
1.66 
3.20 
1.90 
1.58 
4.27 
1.33 
5.78 
3.59 
5.08 
0.81 
2.47 
3.98 
1.48 
1.57 



11.50 15.93 
5.24 5.20 
1.72 1.72 
1.91 1.95 



LR 

CoSh 

0.90 

2.90 

1.70 

3.26 

1.88 

1.46 

4.22 

1.34 

5.76 

3.55 

4.98 

0.80 

2.43 

3.99 

1.51 

1.56 

15.91 

5.28 

1.68 

1.98 



LR 

PoSh 

0.89 

2.92 

1.81 

3.34 

1.72 

1.64 

3.91 

1.33 

6.02 

3.47 

4.83 

0.87 

2.44 

3.88 

1.42 

1.65 

14.76 

4.55 

1.80 

2.01 



kNN kNN 
None Own 
4.23 3.06 



kNN kNN 
uKNCBIN 
3.08 3.08 



3.31 
2.86 
7.87 
5.53 
3.13 
5.20 
2.89 
2.85 
3.54 
8.20 
1.28 
4.00 
4.85 
2.16 
2.27 
6.03 
5.45 
4.21 
2.89 



2.21 
1.82 
5.82 
3.94 
2.30 
3.60 
2.19 
1.92 
2.36 
6.44 
0.86 
2.60 
3.09 
1.44 
1.46 
3.77 
3.94 
2.71 
1.90 



2.27 
1.84 
5.62 
3.75 
2.18 
3.60 
2.19 
1.92 
2.38 
6.25 
0.84 
2.58 
3.11 
1.53 
1.48 
3.78 
3.84 
2.70 
1.93 



2.27 
1.83 
5.63 
3.88 
2.28 
3.59 
2.19 
1.96 
2.39 
6.30 
0.84 
2.58 
3.11 
1.48 
1.46 
3.75 
3.78 
2.74 
1.91 



kNN 

CoSh 

3.02 

2.18 

1.82 

5.81 

3.64 

2.32 

3.52 

2.23 

1.99 

2.33 

6.28 

0.83 

2.53 

3.12 

1.41 

1.47 

4.23 

4.08 

2.69 

1.96 



kNN 

PoSh 

2.84 

2.53 

1.91 

5.72 

3.52 

2.60 

3.38 

2.22 

2.13 

2.87 

6.10 

0.88 

2.49 

3.09 

1.56 

1.56 

3.35 

3.88 

2.65 

2.13 



Tree Tree 
None Own 
3.32 2.48 



3.67 
3.25 
8.38 
4.44 
3.78 
6.06 
3.30 
3.96 
5.55 
8.20 
2.00 
4.20 
6.49 
2.77 
2.33 
8.48 
6.15 
3.73 
2.94 



2.48 
2.09 
6.37 
3.27 
2.46 
4.32 
2.30 
2.49 
3.25 
6.46 
1.31 
2.71 
4.34 
1.84 
1.49 
5.13 
4.74 
2.45 
1.92 



Tree Tree 
uKNCBIN 

2.42 2.49 

2.49 2.54 

2.07 2.12 

6.40 6.29 

3.15 3.30 

2.39 2.44 

4.37 4.35 
2.27 2.33 
2.56 2.55 
3.20 3.45 
6.46 6.47 
1.31 1.30 
2.70 2.70 

4.38 4.38 
1.76 1.82 
1.51 1.51 
5.24 5.54 
4.69 4.74 
2.51 2.53 
1.88 1.90 



Tree 

CoSh 

2.45 

2.61 

2.12 

6.19 

3.12 

2.44 

4.39 

2.25 

2.52 

3.52 

6.46 

1.30 

2.68 

4.50 

1.86 

1.52 

6.66 

4.72 

2.53 

1.93 



Tree 

PoSh 

2.35 

2.52 

2.09 

6.04 

3.01 

2.35 

4.10 

2.00 

2.07 

3.25 

6.43 

1.27 

2.61 

4.22 

1.71 

1.57 

5.21 

4.51 

2.42 

1.99 



AR6.00 3.90 2.35 3.05 2.75 2.95 6.00 2.90 3.00 3.05 2.90 3.15 6.00 3.10 2.75 3.80 3.65 1.70 



Table 7: Results for the absolute loss £^ for the datasets in Table 12, using the experimental method- 
ology in section 2.5. Each row aggregates the folds and ten different values for a per fold with a € 
{0,0.1 11,0.222, 1}. For visibility all the losses are multiplied by 10. Each section of six columns 
shows results for different base techniques (LR, kNN and Tree). The average ranks (AR) are calculated for 
these three groups separately. The Friedman statistic for the three sections are (50.29, 43. 1 1 and 59 respec- 
tively), which are greater than the Critical Value (12.57). This means that the null hypothesis is rejected 
(significance level: 0.05) and the methods do not perform equally. Differences in average ranks higher 
than the critical difference for the Nemenyi post-hoc test (0.5217) imply that the difference is significant (in 
bold). 



The previous result can be simplified for a normal distribution 

Proposition 8. If I is the asymmetric squared error function i S a (y,y) given by definition 9 and f is a normal 
distribution with mean jU(jc) and standard deviation 6(x), then the optimal prediction t is given by first 
calculating t' from the following equation: 

ci 

t i< t , {t i ) + m+t i__ =0 (9) 

and then getting t = 6{x)t' + p.(x). Note the use of the standardised cumulative normal distribution <I> and 
the standardised normal density function (j). 

Even though the value of t' cannot be expressed in a closed form, we only need to calculate this value 
for each a once, since it is calculated for the standard normal distribution. Then, we just use the expression 
t = a(x)t' + fX(x) for each example. 

Figure 7 (right) shows the evolution of this loss for different values of a for one dataset, using several 
reframing methods. The overall results are shown in Table 8 for several base techniques and the same 
enrichment methods as in Table 7. The results are better for the probabilistic (local) reframing methods. 
This indicates that the (more common) asymmetric squared loss, which highly penalises wrong big shifts, 
requires a more detailed (local) reframing. 

Apart from and £ s a , there are many other kinds of asymmetric loss. In fact, for instance, we could use 
a discrete function where loss would be if the error is inside a tolerance band (which can be asymmetric) 
and, e.g., 1 otherwise. We will discuss this notion of 'tolerance' after the following section. 
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10 

11 

12 
13 
14 
15 
16 
17 
18 
19 
20 



LR LR 

None Own 
0.61 0.43 



7.25 
2.67 
6.20 
2.23 
2.76 
8.40 
3.61 



LR LR 
uKNCBIN 
0.46 0.41 



LR LR 
CoSh PoSh 
0.41 0.39 



5.17 
1.90 
4.86 
1.65 
2.11 
6.40 
2.81 



4.96 
1.81 
4.74 
1.57 
2.16 
5.98 
2.47 



5.02 
1.80 
4.84 
1.66 
2.07 
6.00 
2.87 



5.02 
1.82 
4.77 
1.67 
2.05 
5.90 
2.87 



5.15 
1.92 
4.85 
1.54 
2.22 
5.65 
2.83 



185.34129.411 10.45147.8347.87147.' 
8.50 6.11 5.23 6.37 6.37 6.20 



9(4 



9.96 7.82 
0.52 0.39 
5.01 3.72 
11.20 8.22 
2.72 1.96 
2.60 2.00 



7.51 
0.35 
3.35 
7.53 
1.95 
1.80 



7.71 
0.35 
3.35 
7.64 
1.93 
1.81 



7.66 
0.34 
3.36 
7.69 
1.93 
1.79 



7.57 
0.37 
3.44 
7.32 
2.00 
1.86 



163.98127.5839.il 129.83129.5Q27. 
13.19 10.02 9.51 9.52 9.68 9.05 
2.13 1.41 1.44 1.39 1.37 1.43 
2.62 1.77 1.77 1.79 1.79 1.82 



kNN 
None 
5.22 
4.92 
2.87 
14.19 
8.35 
4.37 
6.94 
6.08 
33 
4.78 
14.04 
0.60 
5.45 
7.16 
2.51 
2.35 

84 
7.23 
5.33 
2.81 



i:io 



kNN 
Own 
3.77 
3.56 

I. 88 
10.51 
6.19 
3.16 
4.86 
4.77 
3.13 
3.34 

II. 01 
0.42 
3.66 
4.85 
1.80 
1.62 
7.43 
5.28 
3.51 
1.90 



kNN kNN 
uKNCBIN 
3.79 3.80 
3.60 3.58 
1.89 1.89 
10.25 10.27 
6.10 6.14 



3.10 
4.86 
4.78 
3.15 
3.38 



3.17 
4.84 
4.78 
3.22 
3.39 



10.70 10.78 
0.41 0.41 



3.66 
4.86 
1.85 
1.65 
7.40 
5.17 
3.51 
1.92 



3.66 
4.86 
1.83 
1.64 
7.49 
5.11 
3.54 
1.89 



kNN 
CoSh 
3.67 
3.47 

I. 91 
10.25 
6.23 
3.07 
4.73 
5.84 
3.15 
3.63 

II. 04 
0.41 
3.68 
4.94 
1.93 
1.63 
8.71 
5.42 
3.50 
1.93 



kNN 
PoSh 

3AT 

4.17 

2.08 

9.90 

6.15 

3.29 

4.78 

6.64 

4.00 

7.64 

10.97 

0.45 

3.67 

5.03 

2.01 

1.70 

14.69 

4.90 

3.42 

2.11 



AR6.00 3.85 2.10 2.95 2.90 3.20 5.85 2.45 2.70 2.80 3.25 3.95 6.00 2.85 2.45 3.45 3.70 2.55 



Tree 

None 

3.11 

6.87 

3.62 

15.82 

6.64 

5.39 

8.65 

6.98 

5.93 

7.45 

14.06 

1.22 

5.75 

12.35 

2.93 

2.46 

20.23 

9.41 

4.75 

2.73 



Tree 

Own 

2.34 

4.92 

2.43 

11.99 

5.07 

3.81 

6.20 

5.29 

4.03 

4.73 

11.06 

0.83 

3.87 

8.42 

2.06 

1.67 

12.95 

7.26 

3.19 

1.83 



Tree Tree 
uKNCBIN 
2.26 2.34 
4.93 4.99 
2.38 2.45 
12.04 11.85 
4.97 5.09 



3.73 
6.25 
5.23 
4.13 
4.70 



3.75 
6.23 
5.27 
4.06 
4.87 



11.06 11.06 
0.83 0.82 



3.85 
8.48 
1.99 
1.69 



3.86 
8.47 
2.06 
1.69 



13.06 13.40 
7.17 7.26 
3.30 3.28 
1.78 1.82 



Tree 

CoSh 

2.23 

4.97 

2.44 

11.54 

5.02 

3.76 

6.16 

5.54 

4.21 

4.97 

11.07 

0.82 

3.86 

8.66 

2.07 

1.71 

14.44 

7.12 

3.36 

1.84 



Tree 

PoSh 

2.13 

4.95 

2.41 

11.17 

5.00 

3.92 

5.93 

6.03 

4.41 

4.82 

10.99 

0.82 

3.85 

8.13 

2.10 

1.71 

13.03 

6.82 

3.28 

1.84 



Table 8: Results for the squared loss l a for the datasets in Table 12, using the experimental method- 
ology in section 2.5. Each row aggregates the folds and ten different values for a per fold with a € 
{0,0.1 11,0.222, 1}. For visibility all the losses are multiplied by 10. Each section of six columns 
shows results for different base techniques (LR, kNN and Tree). The average ranks (AR) are calculated for 
these three groups separately. The Friedman statistic for the three sections are (51.91, 45.83 and 49.83 re- 
spectively), which are greater than the Critical Value (12.57). This means that the null hypothesis is rejected 
(significance level: 0.05) and the methods do not perform equally. Differences in average ranks higher 
than the critical difference for the Nemenyi post-hoc test (0.5217) imply that the difference is significant (in 
bold). 



6 Rejection rule applications 

A common situation when working with predictive models appears when there is the possibility of absten- 
tion, i.e., to reject the prediction and do nothing (or delegate to an expert or other kind of model). The 
rationale is to avoid a decision that is likely to have more cost than the abstention itself. In order to do this, 
we need to know what the cost of an abstention is, which may be constant or may depend on the instance. 
According to this information, a decision rule which tries to minimise the cost (as an aggregation of the 
overall prediction loss and the rejection cost) is known as a rejection rule. Several works in the literature 
have been devoted to rejection rules, although most of the work in the area of machine learning is conceived 
for classification [28, 52, 25]. 

We can apply a rejection rule on top of any loss function, such as those seen in the previous sections. For 
instance, for the asymmetric absolute error £^ we can derive the corresponding loss with rejection option 
la R p as follows: 

Definition 10. The asymmetric absolute error E^p with rejection option is a loss function defined as follows: 

*%p($,y) = P if REJECT 
= ia(y>y) otherwise 

A straightforward way of handling this type of loss functions with rejection option is to estimate the 
expected loss and check whether it is greater than p. If this is the case we should reject. Otherwise, we 
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should use the decision rule as if there were no rejection. For instance, for p , we would just calculate 
the expected loss using proposition 4, and compare this to p. Only if it is lower than p would we apply the 
minimisation given by proposition 5. 

However, we need to derive a more operative expression for the expected loss from proposition 4: 

Proposition 9. Consider the asymmetric absolute error function £a(y,y) given by definition 8 and a normal 
conditional distribution f with mean fi(x) and standard deviation 6{x). The expected loss for a prediction 
value t can be further simplified to: 

J?(x,t,f,e A a ) = [t'®(t') + <t>(t')-at']o(x) (10) 

with t' = ' J 1 ^ . Note the use of the standardised cumulative normal distribution <J> and the standardised 
normal density function 0. 

The expected loss given by proposition 9 is then easy to calculate and can be used to compare it with the 
cost of rejection. This leads to the decision rule for REJECT: 

£>(x,tJ,£ A a )>p (11) 

Now we are ready to compare the methods based on probabilistic (local) reframing using the above rule 
with methods which do a global reframing, as we did in the previous section. The methods are: 

• None: No reframing. The prediction (conditional estimated mean) is used as it is. 

• Own, uKNC, BIN. We use the rejection rule given by eq. (11) and proposition 9. In case of no reject 
we apply proposition 5, eq. (6), as used in the previous section. 

• CoSh: We decide whether we reject or not using the optimal reject rate calculated on the training 
dataset. This rate is calculated assuming that a percentage (rate) of examples is just rejected (the 
examples are necessarily chosen randomly, since there is no information about reliability). Given the 
optimal rate and the optimal shift, we reject examples using the rate and, if the example is finally not 
rejected, we use the CoSh method as in the previous section (using the method from [1]). 

• PoSh: Similar to CoSh but the polynomial approach in [72] is used instead. 
For the experiments, we vary both a and p. For p we apply the following function: 

where o{Dy) is the standard deviation of the output variable for the dataset D. We let r range between and 
1. The rationale for the previous function can be explained with the extreme cases. If r = we get p = 
and reject has no cost (so we will always reject). If r = 0.5 we get p = ^a(Dy), which means that with a 
trivial constant model, residuals will equal the standard deviation (and the expected error), so the expected 
loss will be ^a(Dy) = (aa(Dy) + (1 — a)o{Dy)) /2. This means that approximately we will reject half of 
the times (for a trivial model). And finally, for r = 1 we get p = oo and reject has infinite cost (so we will 
never reject). 

Figure 8 (left) shows the evolution of this loss for different values of p, as derived from p (with a fixed 
to 0.5) for one dataset. The overall results are shown in Table 9. We see that probabilistic reframing takes 
advantage of a local decision. In the end, the conditional variance is used to make a ranking, which is crucial 
for rejection rules. 

For the squared loss with reject (£ S a p ), we work similarly: 
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Figure 8: Left: comparing the absolute loss with reject using different methods for dataset savings with base 
technique Tree. Right: comparing the squared loss with reject using different methods for dataset salinity 
with base technique kNN. 



Proposition 10. Consider the asymmetric squared error function t S a {y,y) given by definition 9 and a normal 
conditional distribution f with mean {X (x) and standard deviation a(x). The expected loss can be expressed 
as: 

^(x,t,f,£ s a ) = <D(f')(l-2a) [(t'a) 2 + 3t'a 2 q(t')-2^ 2 +^aq{t')-a 2 ]+aa 2 (t'+l) (12) 

with t' = ' , q(t') = ^fel and notation ]lfor }X(x) and a for d{x). 

Although the previous expression is long, it can be computed easily with the standard normal distribu- 
tion. Figure 8 (right) shows the evolution of this loss for a fixed a = 0.5 and different values of r (from 
which p is derived) for one dataset. The overall results with the same configuration as the absolute loss with 
reject are shown in Table 10. The results are even more clear-cut in this case. 



7 Discussion 

After this jaunt through several families of loss functions using different kinds of reframing methods (local 
or global) we are ready to make a comprehensive analysis of the results, see other (many) applications and 
close the paper with the overall contributions and some future work. 

7.1 Overview of results and contributions 

As a short recapitulation of results, we can just summarise tables 5, 6, 7, 8, 9 and 10 by counting the number 
of cases where each reframing is in the group of the (statistically significant) best results. This is 17 (out of 
18) for probabilistic local reframing in front of 8 (out of 18) for global reframing. If we focus on particular 
methods, the probabilistic local reframing based on the enrichment method uKNC is in the group of best 
results 15 times (out of 18) in front of only 6 (out of 18) for the best global reframing (PoSh, and CoSh for 
the first two tables). In fact, if we compare uKNC against PoSh (CoSh for the first two tables) using the 
Nemenyi post-hoc test difference in each case, we have 1 1 wins, 6 ties and 1 lose. This supports the claim 
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10 

11 

12 
13 
14 
15 
16 
17 
18 
19 
20 



LR LR 
None Own 
1.40 0.68 



LR LR 
uKNCBIN 
0.68 0.66 



4.47 
2.55 
4.18 
2.62 
2.18 
6.01 
1.72 
7.38 
4.89 
6.58 
1.27 
3.84 
6.00 
2.28 
2.47 



1.89 
1.14 
2.23 
1.32 
1.06 
3.09 
0.89 
1.54 
2.14 
3.51 
0.66 
1.85 
2.71 
1.01 
1.28 



20.18 8.83 
7.27 3.60 
2.74 1.15 
2.97 1.24 



1.68 
1.08 
2.10 
1.09 
1.12 
2.66 
0.70 
1.24 
1.49 
3.22 
0.57 
1.52 
2.24 
1.01 
1.02 
3.96 
3.17 
1.14 
1.25 



1.72 

1.07 

2.19 

1.32 

1.06 

2.67 

0.91 

3.94 

2.29 

3.41 

0.57 

1.51 

2.29 

0.98 

1.03 

10.43 

3.18 

1.13 

1.27 



LR 

CoSh 

1.10 

1.89 

1.33 

2.19 

1.49 

1.13 

2.45 

0.92 

3.87 

2.29 

2.92 

0.99 

1.50 

2.29 

1.36 

1.11 

5.18 

2.63 

1.28 

1.45 



LR 

PoSh 

1.08 

1.82 

1.36 

2.13 

1.41 

1.17 

2.40 

0.90 

3.81 

2.31 

2.90 

1.01 

1.48 

2.24 

1.31 

1.13 

5.16 

2.43 

1.30 

1.47 



kNN kNN 
None Own 
4.23 1.96 



kNN kNN 
uKNCBIN 
1.97 1.96 



3.31 
2.86 
7.87 
5.53 
3.13 
5.20 
2.89 
2.85 
3.54 
8.20 
1.28 
4.00 
4.85 
2.16 
2.27 
6.03 
5.45 
4.21 
2.89 



1.38 
1.15 
3.76 
2.63 
1.45 
2.24 
1.51 
1.21 
1.47 
4.37 
0.61 
1.59 
1.86 
0.94 
0.95 
2.29 
2.52 
1.64 
1.23 



1.43 
1.17 
3.53 
2.48 
1.36 
2.24 
1.51 
1.22 
1.50 
4.08 
0.59 
1.57 
1.88 
1.01 
0.97 
2.29 
2.40 
1.64 
1.25 



1.43 
1.16 
3.55 
2.58 
1.45 
2.22 
1.51 
1.27 
1.49 
4.16 
0.59 
1.57 
1.88 
0.97 
0.96 
2.27 
2.36 
1.67 
1.23 



kNN 

CoSh 

1.84 

1.50 

1.31 

3.80 

2.49 

1.51 

2.20 

1.66 

1.45 

1.62 

3.93 

0.94 

1.54 

1.94 

1.19 

1.06 

2.36 

2.21 

1.71 

1.45 



kNN 

PoSh 

1.81 

1.62 

1.31 

3.95 

2.40 

1.75 

2.18 

1.66 

1.55 

2.01 

3.88 

0.95 

1.53 

1.92 

1.22 

1.10 

2.00 

2.17 

1.66 

1.51 



Tree Tree 
None Own 
3.32 1.70 



AR6.00 3.50 1.80 2.75 3.70 3.25 6.00 2.50 2.85 2.60 3.70 3.35 6.00 2.85 2.30 3.45 3.70 2.70 



Tree Tree 
uKNCBIN 
1.61 1.72 



3.67 
3.25 
8.38 
4.44 
3.78 
6.06 
3.30 
3.96 
5.55 
8.20 
2.00 
4.20 
6.49 
2.77 
2.33 
8.48 
6.15 
3.73 
2.94 



1.56 
1.32 
4.16 
2.23 
1.57 
2.70 
1.52 
1.50 
1.92 
4.39 
0.89 
1.66 
2.51 
1.21 
0.97 
2.83 
3.25 
1.53 
1.24 



1.55 
1.29 
4.17 
2.10 
1.48 
2.78 
1.51 
1.54 
1.87 
4.39 
0.89 
1.64 
2.53 
1.14 
0.98 
2.87 
3.17 
1.56 
1.21 



1.58 
1.33 
3.94 
2.23 
1.54 
2.73 
1.54 
1.57 
2.05 
4.39 
0.89 
1.64 
2.53 
1.19 
0.99 
3.00 
3.20 
1.59 
1.23 



Tree 
CoSh 
T57~ 
1.71 
1.49 
3.83 
2.21 
1.75 
2.46 
1.59 
1.87 
2.33 
4.04 
1.14 
1.61 
2.44 
1.40 
1.10 
3.05 
2.39 
1.61 
1.41 



Tree 

PoSh 

1.55 

1.71 

1.47 

3.90 

2.15 

1.61 

2.40 

1.46 

1.56 

2.16 

4.03 

1.14 

1.59 

2.38 

1.33 

1.11 

2.87 

2.36 

1.58 

1.41 



Table 9: Results for the absolute reject loss £jf p for the datasets in Table 12, using the experimen- 
tal methodology in section 2.5. Each row aggregates the folds and five different values for a with 
a G {0,0.25,0.5,0.75, 1} and ten different values for p with r G {0,0.111,0.222, 1} and p = \a{D Y )^r r 
(totalling 50 variations for fold). For visibility all the losses are multiplied by 10. Each section of six 
columns shows results for different base techniques (LR, kNN and Tree). The average ranks (AR) are cal- 
culated for these three groups separately. The Friedman statistic for the three sections are (56.03, 48.83 and 
50.26 respectively), which are greater than the Critical Value (12.57). This means that the null hypothe- 
sis is rejected (significance level: 0.05) and the methods do not perform equally. Differences in average 
ranks higher than the critical difference for the Nemenyi post-hoc test (0.5217) imply that the difference is 
significant (in bold). 



that an appropriate probabilistic refraining with the use of a lightweight normal conditional distribution using 
appropriate estimations (directly or through enrichment methods) is a good approach for a wide variety of 
cost-sensitive problems. 

This claim is accompanied by a series of major contributions: 

• We push forward an appropriate mapping between classification and regression, and develop the right 
parallelism between crisp and soft models in classification and regression. 

• We vindicate refraining as a flexible and powerful approach for context-sensitive applications, and 
characterise the distinction between global reframing (typically based on crisp regression models) 
and local reframing (typically based on soft regression models). 

• We uphold a lightweight view of soft regression models as normal conditional density estimators, only 
requiring two parameters. This entails several benefits: easier, more robust estimation and simpler 
optimisation formulae resulting from expected loss. 

• We introduce new metrics for the evaluation of soft regression models. 

• We present enrichment as a way to convert any traditional one-parameter crisp regression model into 
a two-parameter soft regression model, by just working with the actual and predicted values. This 
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LR LR 

None Own 



10 
11 

12 
13 
14 
15 
16 
17 
18 
19 
20 



Q.61 
7.25 
2.67 
6.20 
2.23 
2.76 
8.40 
3.61 



0.31 
3.26 
1.32 
3.54 
1.15 
1.56 
4.41 
1.92 



LR LR 
uKNCBIN 
0.40 0.30 



2.90 
1.14 
3.45 
1.12 
1.62 
4.08 
0.93 



3.01 
1.12 
3.36 
1.15 
1.47 
4.11 
1.94 



185.3411.03 
8.50 3.75 
9.96 5.32 
0.52 0.27 
5.01 2.58 
11.20 5.31 
2.72 1.35 
2.60 1.37 
163.9876.23 
13.19 6.86 
2.13 0.98 
2.62 1.22 



10.11 99.93 
2.02 4.42 



5.21 
0.26 
1.97 
3.83 
1.29 
1.14 



5.28 
0.25 
1.93 
4.14 
1.32 
1.17 



12.29 88.02 
6.07 6.28 
1.04 0.96 
1.23 1.25 



LR 

CoSh 

1.33 

3.02 

1.45 

3.85 

1.74 

1.55 

3.74 

2.12 

86.15 

3.71 

4.94 

1.33 

1.96 

3.92 

1.81 

1.38 

21.47 

4.38 

1.45 

1.62 



LR 

PoSh 

1.33 

3.16 

1.49 

3.81 

1.66 

1.64 

3.65 

2.10 

86.25 

3.63 

4.91 

1.33 

1.97 

3.86 

1.82 

1.39 

21.23 

4.30 

1.44 

1.63 



kNN kNN 
None Own 
5.22 2.55 
4.92 2.23 
2.87 1.14 
14.19 7.03 
8.35 4.19 



kNN kNN 
uKNCBIN 
2.64 2.65 



4.37 
6.94 
6.08 
4.33 
4.78 



1.96 
3.24 
3.25 
2.03 
2.10 



14.04 7.48 
D.60 0.31 



5.45 
7.16 
2.51 
2.35 



2.10 
2.93 
1.20 
0.93 



10.84 4.21 
7.23 3.69 
5.33 2.08 
2.81 1.28 



2.41 
1.18 
6.47 
4.06 
1.88 
3.21 
3.25 
2.09 
2.20 
7.28 
0.30 
2.14 
2.93 
1.29 
1.06 
4.16 
3.49 
2.11 
1.32 



2.32 
1.16 
6.52 
4.16 
2.07 
3.19 
3.25 
2.17 
2.17 
7.32 
0.30 
2.13 
2.93 
1.25 
1.03 
4.27 
3.41 
2.12 
1.27 



kNN 

CoSh 

2.27 

2.48 

1.54 

6.74 

4.33 

2.05 

3.28 

4.01 

2.21 

2.72 

7.22 

1.20 

2.07 

3.06 

1.44 

1.36 

4.55 

3.05 

2.19 

1.58 



kNN 

PoSh 

2.09 

2.69 

1.58 

6.61 

4.25 

2.19 

3.28 

4.43 

2.43 

3.89 

7.23 

1.21 

2.07 

3.14 

1.49 

1.38 

6.00 

2.94 

2.15 

1.61 



Tree Tree 
None Own 
3.11 1.62 
6.87 3.09 
3.62 1.51 
15.82 8.26 
fe.64 3.49 



AR5.80 3.35 1.95 2.65 3.60 3.65 5.90 2.30 2.60 2.45 3.70 4.05 5.90 2.65 2.55 3.20 3.50 3.20 



Tree Tree 
uKNCBIN 
1.59 1.62 



5.39 
8.65 
6.98 
5.93 
7.45 



2.52 
4.26 
3.49 
2.43 
2.60 



14.06 7.52 
1.22 0.60 
5.75 2.29 
12.35 4.67 
2.93 1.36 
2.46 0.94 
20.23 5.32 
9.41 4.98 
4.75 1.93 
2.73 1.24 



3.11 
1.41 
8.27 
3.42 
2.38 
4.33 
3.40 
2.61 
2.51 
7.52 
0.60 
2.22 
4.75 
1.27 
1.00 
5.47 
4.94 
2.16 
1.16 



3.22 
1.57 
7.82 
3.50 
2.44 
4.31 
3.42 
2.42 
2.90 
7.52 
0.60 
2.24 
4.75 
1.37 
1.00 
5.67 
4.98 
2.14 
1.21 



Tree 

CoSh 

TST 

3.06 

1.79 

7.37 

3.69 

2.75 

3.76 

3.97 

2.73 

3.03 

7.18 

1.52 

2.14 

4.24 

1.63 

1.42 

6.04 

3.58 

2.08 

1.70 



Tree 

PoSh 

L59~ 

3.01 

1.79 

7.39 

3.70 

2.87 

3.74 

4.07 

2.89 

2.92 

7.12 

1.52 

2.12 

4.15 

1.62 

1.42 

5.99 

3.54 

2.01 

1.72 



Table 10: Results for the squared reject loss l s ^ p for the datasets in Table 12, using the experimen- 
tal methodology in section 2.5. Each row aggregates the folds and five different values for a with 
a G {0,0.25,0.5,0.75, 1} and ten different values for p with r G {0,0.111,0.222, 1} and p = \a{D Y )^r r 
(totalling 50 variations per fold). For visibility all the losses are multiplied by 10. Each section of six 
columns shows results for different base techniques (LR, kNN and Tree). The average ranks (AR) are cal- 
culated for these three groups separately. The Friedman statistic for the three sections are (48.4, 54.03 and 
43.23 respectively), which are greater than the Critical Value (12.57). This means that the null hypothe- 
sis is rejected (significance level: 0.05) and the methods do not perform equally. Differences in average 
ranks higher than the critical difference for the Nemenyi post-hoc test (0.5217) imply that the difference is 
significant (in bold). 



has several advantages: enrichment methods are easily applicable to any regression method, they only 
require the actual output values of a training (or validation) dataset, and the two-parameter normal 
density estimator does not need to be recalculated whenever the loss function changes. 

• We develop new straightforward enrichment methods which show good performance as conditional 
density estimators. 

• We show that local reframing has a broad range of applicability over many different problems. We 
illustrate its effectiveness on three families of problems (bid applications, asymmetric loss applications 
and rejection rule applications) by the theoretical derivation of the expression that leads to optimal 
reframing in each case and a thorough empirical validation against other approaches, such as global 
reframing. 

These contributions are important for machine learning because regression, unlike classification, has lacked 
a comprehensive and effective approach to deal with cost-sensitive problems by the reuse (and not a re- 
training) of general regression models. 
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7.2 Alternative approaches and other applications 

The experimental results shown in previous sections could even be better for probabilistic reframing if we are 
able to get better enrichment methods (or other normal conditional density estimation methods). In fact, we 
envisage an intensive work in this line, similar to what was done in the last decade for probability estimation 
in classification, where many classification methods were rethought and redesigned to get good probabilities 
or good rankings (e.g., probability estimation trees [54, 26, 24]). Similarly, an important progress was made 
in calibration methods [70, 3]. Other possibilities for soft regression models could be conceived, leading to 
possibly simpler minimisation solutions (e.g., triangular or uniform distributions). Also, distributions with 
more parameters (e.g., asymmetric normal, truncated normal or Levy distributions) could be explored. 

The applicability of the enrichment and reframing methods for a diversity of base regression techniques 
(from non-parametric regression trees and kNN to parametric LR) with very different mathematical and 
statistical properties has led us to validate the approaches experimentally. However, the definition of specific 
combinations of a base technique with a particular enrichment method (e.g., kNN with uKNC, or LR with 
ENR — LR) could be analysed theoretically to derive statistical properties that may characterise their general 
behaviour better. 

One important element in the development and improvement of soft and probabilistic regression models 
is the use of appropriate evaluation metrics and graphical representations, prior to any specific context- 
sensitive application, as has been done here in section 3 (before sections 4, 5, 6). We have used msll and 
msvr here, but other (new) metrics could be used, inspired by classification metrics[29] and plots such as 
ROC curves, cost curves [17], Brier curves [41], calibration plots, etc., and their derived measures, such as 
AUC [33, 27]. 

This paper has only included some representative applications, by choosing some common loss func- 
tions. There are, of course, many other domains and possible loss functions. For instance, tolerance is a 
concept that has been frequently used to bring ideas from classification to regression, since a tolerance level 
can be used to classify estimations as 'correct' or 'incorrect'. An example of a general tolerance loss can be 
defined as follows, by considering asymmetric losses and asymmetric tolerance levels for overestimations 
and underestimations: 

Definition 11. The tolerance loss ij a T - T + is a loss function defined as follows: 

tT,a,T-,T+(y,y) = oc ify + T~<y 
= (l -a) ify-T + >y 
= otherwise 

A related loss could originate from ordinal prediction if we define a loss function in such a way that 
it is if the prediction is inside the bin of the discretisation (e.g., low (0..3), mid (3. .7), high(7..10)), and, 
say, the number of bins it has to cross to go to the right bin otherwise. It would be interesting to see how 
probabilistic reframing could work in these two cases. 

Apart from the loss functions which relate the true value y and the estimated value y, there are many other 
kinds of costs and contexts [66]. For instance, loss functions can be instance-dependent, such as those that 
are a function of the input values, represented as £(y,y,x). It is important to note again that this invalidates 
global reframing methods (but not local reframing methods). More generally, we can even have a relevance 
or prior distribution U (x). This can be addressed by giving more relevance to some examples than others, in 
the integration (or sum) of the overall estimated cost (£U(x)£(y,y)) or in graphical representations, such as 
ROCIV (instance- varying ROC curves, [23]). In other cases, this relevance function can be more complex, 
as in the so-called utility -based regression [64, 56]. One way or another, it is important to realise that the 
methods in this paper are applicable when there is a change of the prior distribution, since the minimisation 
of the loss is local to each example (the experiments in this paper used 2-fold cross-validation without 
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shuffling to simulate this situation). Changing the output distribution (or relevance) does not change the 
methods, since each optimisation is independent from the rest. 

Finally, a soft regression model (issuing probabilities, reliabilities or confidence intervals) can be useful 
for some tasks, such as quantification [34] for regression, in the same way it has been shown beneficial 
for quantification for classification [6]. Also, screening applications can also take advantage of enrichment 
methods, since some elements in a rank could be considered a tie if their conditional distributions overlap 
a certain degree (possibly determining this with a test over the two normals, such as a KS-statistic). This 
can be applied to preference learning, where we can answer not only whether for two given examples x\ and 
X2 we have that y\ > $2, but we can also calculate the probability Prob{y\ > J2) (if the regression model 
is probabilistic). This, for instance, suggests an evaluation metric related to the Wilcoxon-Mann-Whitney 
statistic interpretation of the AUC (area under the ROC curve), simply as Prob(y\ >yz\y\ > ^2)- 

7.3 Concluding remarks 

The goal of the paper was to show that cost-sensitive applications in regression can be successfully handled 
by a probabilistic reframing using enriched regression models in the form of a two-parameter normal con- 
ditional distribution. In order to accomplish this goal we needed to compare enrichment methods to other 
approaches for conditional density estimation in terms of estimation quality and efficiency. Another impor- 
tant issue that we needed to consider is the simplicity of the expressions leading to the optimal reframing 
with minimum expected loss (minimum risk). The choice of a normal distribution consummates all this, and 
consolidates a view of regression as a two-parameter estimation problem: conditional mean and variance. 
Also, we have seen that we can enrich any existing regression technique with reasonable good variance 
estimations, using some existing techniques and, most especially, some novel enrichment methods that are 
extremely lightweight. Enrichment methods in regression are somewhat similar to calibration methods in 
classification. However, the key difference is that the original prediction is kept and complemented by a 
second parameter, the variance. 

Other approaches for context-sensitive applications build a model which is specialised for a very specific 
context, embedding the context in the model. In reframing, we reuse a general model for a wide range of 
contexts and operating contexts. The philosophy is completely different: models can be reused and validated 
across different operating contexts, improving robustness and efficiency. 

Local reframing uses information about each prediction (reliability, confidence or probability) to adapt 
each local prediction. When we have probabilities (from the use of a conditional density function, e.g., a 
normal distribution), we can solve the decision rules analytically and, in cases where a closed form cannot 
be derived, use simple numerical approximations. In fact, probabilistic reframing only needs to derive 
the conditional variance once and for all, either from the method itself (e.g., regression trees derive this 
variance as the variance in each leaf of the tree) or by the use of enrichment methods, which only require 
the comparison to the output value. Once a regression model is equipped with a good conditional variance 
estimation we can apply the model to a variety of problems. Moreover, we can even use a different loss 
function (or different loss parameters) for each individual example. 

Global reframing, on the other hand, tries to infer one global function from the training set which is 
applied to all the examples. This implies an optimisation procedure over the whole training set whenever 
the loss function (or any of its parameters) changes. Also, except for some convex loss functions where 
some efficient numerical methods can be used, this procedure may be time-consuming. These differences 
and the fact that the experimental results are, in general, favourable to probabilistic reframing, suggest that 
an instance-based (local) approach is the best option for reframing. It also links much better to the areas of 
risk minimisation in decision theory. 

Overall, this paper contains a number of contributions and integrates a wide range of techniques that 
should trigger further research on conditional variance estimation, enrichment methods, calibration tech- 
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niques for regression, evaluation metrics for regression, and better refraining techniques on these and other 
context-sensitive applications of regression models. 
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A Datasets 

Datasets, shown in tables 11 and 12, are obtained from eight packages of the CRAN distribution of R- 
project [55], namely: 'class', 'boot', 'MASS', 'nlme', 'lattice', 'np', 'survival' and 'farwary'. Some of 
them have been processed to eliminate redundant or null attributes. All the datasets and the scripts in R (for 
the methods and tests) are available at http://www.dsic.upv.es/-jorallo/reframe-reg/ 
scripts-data . zip. 





name 


size 


attr 


TrTeMD 


TrTe 


1 


seatbelts 


192 


7 


1.22 


0.53 


2 


theoph 


132 


4 


0.10 


0.11 


3 


USjudgeratings 


44 


12 


0.24 


0.25 


4 


cars 


50 


2 


1.72 


0.74 


5 


faithful 


272 


2 


0.02 


0.07 


6 


boston 


506 


14 


0.40 


0.24 


7 


UScrime 


48 


16 


0.32 


0.19 


8 


gilgais 


364 


9 


0.20 


0.10 


9 


wtloss 


52 


2 


3.57 


1.00 


10 


cefamandole 


84 


3 


0.37 


0.14 


11 


dialyzer 


140 


4 


0.42 


0.47 


12 


earthquake 


182 


5 


0.31 


0.18 


13 


gasoline 


32 


6 


2.09 


0.75 


14 


glucose 


376 


4 


0.07 


0.10 


15 


IGF 


236 


3 


0.03 


0.11 


16 


nitrendipene 


88 


4 


0.09 


0.14 


17 


wheat 


48 


4 


0.74 


0.42 


18 


environmental 


112 


4 


0.28 


0.20 


19 


wagel 


526 


21 


0.23 


0.12 


20 


ozone 


330 


10 


0.28 


0.20 



Table 11: Dataset battery used in section 3 and related appendices. We show the size, the number of 
attributes, the relative difference in means (of the output value) between train and test (TrTeMD) and the 
Kolmogorov-Smirnoff statistic between train and test (TrTeKS). 
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name 



size attr TrTeMD TrTeKS 



2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 



iris3 

savings 

US arrests 

rock 

trees 

salinity 

birthwt 

menarche 

road 

stormer 

bodyweight 

oxboys 

oecdpanel 

lungcancer 

Chicago 

diabetes 

divusa 



150 4 2.48 0.70 

50 5 0.22 0.20 

50 4 0.38 0.20 

48 4 7.73 0.83 

32 3 3.37 0.94 

28 4 0.16 0.43 

188 10 0.67 0.63 

24 3 4.97 1.00 

26 6 0.09 0.46 

24 3 0.32 0.25 

176 4 11.09 1.00 

234 3 0.10 0.09 

616 7 0.58 0.26 

168 10 0.67 0.33 

48 7 0.34 0.17 

402 3 0.11 0.07 

76 7 0.54 0.55 

256 3 0.40 0.47 

96 9 1.33 0.50 

38 9 0.28 0.26 



exa 



prostate 
seatpos 



Table 12: Dataset battery used in sections 4, 5 and 6. We show the size, the number of attributes, the relative 
difference in means (of the output value) between train and test (TrTeMD) and the Kolmogorov-Smirnoff 
statistic between train and test (TrTeKS). 



B Evaluation metrics for conditional density estimators 

We present three metrics for evaluating the conditional mean, the conditional variance and the conditional 
density of a soft regression model. Since we need to work with any possible base regression model we 
present general measures instead of technique- specific measures for particular goodness-of-fit, parameter 
estimation or intrinsic variance estimation. In order to make results more commensurate and easier to 
compare, for all the measures which are not in the interval [0, 1], we will apply the logistic function A(f) = 
pr^pt . We will use the word 'standardised' to refer to this logistic normalisation. In some cases, we will 
apply the function 1 — t or other transformations to always get a decreasing [0, 1] scale (0 for very good 
estimations and 1 for very bad estimations). 

The evaluation of the conditional mean or expected value is usually measured by the mean squared error 
m Ti{x,y)eD(y ~ m i x )) 2 over a dataset D, although other metrics are also common, such as the mean absolute 
error, several correlation indices, mean relative squared error, etc. We will use the mean relative squared 
error (mrse) to make the measure less dependent on the dataset and easy to compare with the constant 
(trivial) regression model (the model which always outputs the mean of the training dataset, jj.(D y )): 



The factor ln3 and the linear transformation makes that we get 0.5 if the error is the same as the constant 
(trivial) model, for a perfect regression model and close to 1 for very bad estimations. 

A more complex issue is to evaluate the quality of conditional density estimators. One possibility is 
the mean squared error for the distributions, i.e. f(f(y\x) — f{y\x)) 2 , but cannot be properly seen as an 
evaluation metric, since f{y\x) is usually unknown or is pointwise infinite if we calculate this for a given 
dataset. This also happens for other distribution divergences, such as the KL-divergence. Consequently, one 
common measure is the mean negative log-likelihood {nil), which is defined as p- L{a,>)gd ~~ ^ n (f(y\ x ))- We 
will again use the logistic function, applied to the log-likelihood, i.e.: v = A(ln(f(y\x))) = — _ * , , = 




(13) 
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-. From here, we just switch to 1 — v to get for very good estimations and 1 for very bad estimations, 



and derive the mean standardised likelihood (msll), as follows. 



msll(f,D) 



1 

\D\ 



I 1 

(x,y)eD 



1 



1 



f(ym 



(14) 



The log-likelihood (or its logistic variant) evaluates, at the same time, the quality of the mean and the 
quality of the variance. If we want a measure of the latter only, a possibility might be the squared error 
between the residual and the standard deviation. However, this measure also depends on how well the 
means are estimated, since when mean estimations are accurate (respectively inaccurate) the residuals are 
low (respectively high), and variances would tend to be low (respectively high) as well. An alternative is to 
calculate the variance ratio versus the squared residual. If we denote the residual as resf(x, y) = {{if{x) — y), 



we can define the variance ratio as vrt(x,y) 



r. The numerator is the estimated variance and the 



resf(x,y) 2 ' 

denominator is the squared residual. This ratio will be close to 1 if both quantities are similar. If both the 
numerator and the denominator are 0, vr^(x, y) = 1 by definition. From here, the mean standardised variance 
ratio is given by the logistic function of the log ratio: 



msvr(f,D) 



1 

W\ 



£ l-2A(Hvr f (x,y))) 



(x,y)£D 



1 

W\ 



I 

(x,y)eD 



1-2 



1 



1 +vrj(x,y) 



(15) 



This measure is always between and 1, with being a perfect variance estimation (variance is always 
equal to the squared residual) and 1 being the worst variance estimation (variance being much higher or 
much lower than the squared residual). 



C Conditional density estimation methods 

It can be argued that if we want to obtain a conditional density estimation, we should use conditional density 
estimation techniques, instead of crisp regression methods. Conditional density estimation techniques [44] 
are methods which directly 4 obtain f(y\x). While this is the most general and informative way for the 
regression problem (since conditional means, variances, confidence intervals and other measures can be 
obtained from it), the techniques are usually slower and suffer from a number of restrictions. A general way 
to tackle this estimation is through non-parametric methods (see, e.g., [43]). For instance, many approaches 
are restricted to only one (or two input variables, such as R's hdrcde package [44]), or just calculate 
multivariate densities, which have to be normalised for each input value x to get a univariate density. 

It is not the goal of this paper to evaluate several of the approaches for density estimation methods, but 
it is important to see whether these methods are better, in general, than 'augmented' or 'enriched' methods 
for which just a conditional mean and variance are obtained (from which a simple Gaussian density function 
is estimated). In addition, we are interested in the result of 'reducing' a local and detailed density function 
into a Gaussian. In order to do all this, we illustrate this approach with a kernel-based (non parametric) 
conditional density method. We use the function npcdens in the R's np (non parametric) package. This 
function computes estimates for the density function (i.e., f(y\x)), for a bandwidth specification using the 
method in [39]. From the density we calculate the conditional mean and the conditional variance as a 
pointwise average which approximates the integral of the expected value and the moment respectively. We 
used the median point and the four points at ±a and ±2a, given by a Gaussian. 

4 Some other methods calculate the joint distribution f{y.x) or the likelihood f{x\y), from which the conditional density is just 
derived by dividing by f(x) or applying Bayes theorem respectively. However, this is usually more complex than the original 
problem. 
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0.72 


0.74 0.60 
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Table 13: Results (using the datasets in Table 1 1) for the kernel-based (non parametric) conditional density 
method given by the function npcdens in the R's np (non parametric) package. The first three columns 
show the results for the parametric density estimation (using the Gaussian approximation). The column 
"CDE orig msll" shows the msll result by using the original non-parametric density function. The six 
rightmost columns show the results for the mean given by the base method with the variance estimation 
given by the conditional density method (using the Gaussian approximation). Results for mrse are not 
shown for the six most right columns since they are equal to Table 2. 



Table 13 shows the results of this method. Non-parametric conditional density estimation methods can 
get good estimations for large datasets with complex densities (e.g., bimodal) but here we see that the results 
are, in general, worse than those of simple regression methods such as a kNN or Tree for the conditional 
mean, as shown in Table 2. In fact, while the conditional variances seem better (0.53 in front of 0.70, 0.56 
and 0.58 in Table 2.), the conditional densities are not better (except for LR) and the squared error {mrse) 
is also worse. A possible idea is then to combine the good conditional means from the base techniques 
with the conditional variance from the conditional density estimation method. This is what the six last 
columns show. However, as expected, this does not increase the quality of the conditional densities, because 
conditional variance estimates must be linked to a mean estimation. Consequently, neither as a standalone 
method nor combined with the base classifier can we get better performance. Also, this method is about two 
orders of magnitude slower than the direct methods in subsection 3 . 1 (and many other methods that we see 
in the rest of section 3). 

D Conditional variance estimation methods 

Instead of deriving a full conditional density, we can just (re-)use the conditional mean of a classical (crisp) 
regression method and derive a conditional variance. A usual, but not generally well-known, way of esti- 
mating the conditional variance is as follows (steps 1 and 2 can be omitted if we already have a regression 
model): 
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Definition 12. Given a training or validation set T, and a ( test) instance x, the two-step conditional variance 
estimation method 2SCVE is defined as follows. : 

1. Train a regression model m y using T. 

2. Obtain yi <— m y {x\)for each example {{xi,yi}} £ T. 

3. Calculate the residuals: Ui (y; —yi). 

4. Apply a transformation function 6: V; <— d{ui). 

5. Train a regression model m v for the dataset H = {(x,-, v,)}. 

6. Obtain y = m y (x) and v = m v {x) for the example x to be predicted ( in the test set). 

This estimates the conditional mean as jX{x) =y and the conditional standard deviation as 6(xj) = (v). 

Usual choices for 6(t) are 6(t) = t 2 and 6(t) = ln(f 2 ), i.e., we model the (logarithm of) the squared residuals 
[69, 67]. The square is usually included since these methods are aimed at estimating the variance and also 
because otherwise we would need to remove the sign. 

The estimation will depend on the quality of the regression model m y and most especially on the second 
regression model m v . In fact, the previous algorithm (from steps 1 to 5) is usually iterated by retraining the 
regression model m y (x) using the heteroscedasticity information about the recently estimated conditional 
variance. This information can only be used by some regression techniques, e.g., a weighted least squares 
with inverse variance weights. This is usually called iteratively re-weighted least squares. 

At this point it is important to notice that we really estimate the variance of the residuals of our model 
conditional to x, not the variance of y conditional to x. Only if m y is a perfect regression model, these two 
variances will be equal. 

It is usual to apply a non-parametric model in step 5. For instance, if we use nearest neighbours, the 
previous algorithm boils down to estimating G (xi) as the mean of the squared residuals of the ^-closest 
examples, which is similar to what we did for kNN in section 3.1. 

Here we will explore the kNN and Tree techniques for the residual model m u (x), jointly with the three 
base techniques LR, kNN and Tree as usual. Table 14 shows some of these methods for 8(t) = t 2 (we ran 
the same experiments with other configurations of 6 with equal or worse results). The results only show an 
improvement for LR compared to the results in Table 2. For kNN or Tree, the results are worse than the 
results in Table 2. 

E Conditional variance estimation based on reliability 

A reliability measure for regression is any numerical value which is directly related to the degree of certainty 
about an accurate prediction being produced or, more precisely, inversely related to the expected (absolute) 
residual. However, the magnitude of this value can follow any scale. Bosnic & Kononenko [8] compare 
several reliability estimators for regression. Among them, CNK is a simple method which shows good per- 
formance (as a reliability estimator). This method works as follows. For each example (x,y), this method 
just calculates de /c-closest elements in the training set to x and calculates the mean of their output values, 
denoted by C or, in other words, calculates the kNN prediction for x. Then it calculates the absolute differ- 
ence between C and the prediction (of presumably a regression technique which is not kNN). This is the 
estimated standard deviation. 

The previous approach makes an average of the true values, and then compares this to a single estimation. 
While this is good as a reliability measure, the magnitude of the estimation will typically be low (for a 
standard deviation), since it just compares the prediction of two methods. Consequently, we suggest a 
correction, which goes as follows: 
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Table 14: Results (using the datasets in Table 11) for several base methods (LR, kNN and Tree) with 
conditional variance estimation using kNN and Tree as models for the residuals. All the methods use 
0(f) = t 2 . The results for mrse are not shown since they are equal to Table 2. 



1. Given an example (x,y), we estimate y by any base regression technique. 

2. Let S = {xi,yt) the set of the k nearest neighbours of x in a training or validation dataset. 

3. Calculate £/^. )y A e s(y — yi) 2 as the output estimated variance. 

Since it can be seen as a symmetric version to method CNK, we call it KNC. The results are shown in Table 
15. As we can see, CNK is not a good variance estimation method in general. In addition, it cannot work, 
by definition, for kNN, since this method uses kNN as the true value, and the estimated 'residuals' will be 
0. This can be clearly seen on the two columns "kNN CNK". On the contrary, KNC works well. In fact, it 
improves the results for the LR base technique shown in Table 2. 

F Conditional variance estimation using conformal prediction 

Confidence intervals are an alternative (and statistically convenient) way of measuring the reliability of a 
prediction. Conformal prediction [60, 50] is a general technique for deriving confidence intervals. It can 
be applied to any predictive task, such as classification and regression. While originally introduced for 
a transductive scenario, it has also been extended to a more classical inductive setting [49]. Conformal 
prediction works as follows. Given an error probability e and any regression method that makes a prediction 
y, it produces a region T e such that it contains the true value y in at least a proportion 1 — e of the cases (the 
confidence level). Logically, by making the region infinitely large, we can get any confidence level. The key 
issue is that the tightness and therefore usefulness of the prediction region depends on the nonconformity 
measure used. A nonconformity measure is any measure that evaluates how unusual an example is (with 
respect to the others). In regression, for instance, a typical nonconformity measure is the absolute difference: 

\y-y\- 
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Table 15: Results (using the datasets in Table 11) for several base techniques (LR, kNN and Tree) with 
Bosnic & Kononenko's CNK [8] and a more accurate variation that we dub KNC. 



The idea of outputting a confidence region is richer and more informative than only a prediction point 
(which in regression is just the conditional mean). One advantage of confidence regions is that there is 
no assumption about the conditional distribution. However, this is one of its drawbacks for cost-sensitive 
learning, because we cannot quantify the probability of error in the prediction 5 . 

One way of deriving a density function from an interval is by assuming a distribution. Since we advocate 
for the normal distribution for context-sensitive applications, we can devise a simple method by assuming 
this distribution. In particular, for a normal distribution (with cumulative distribution <J>^ CT 2) we know that a 
proportion p of the values inside pL ± ao is given by p = (/J. + aa) — Qua 2 (M ~~ a °)- F° r a conformal 
region T e we know that a proportion 1 — e of the values fall inside the region. By taking different values 
for a we can get different points where we can derive the correspondence. For instance, for a = 1 , we get 
p = 0.6827. Setting e = 1 — 0.6827 we then calculate the conformal region r - 3173 . The width of this region, 
denoted by width(r° 3173 ), has to be 2aa. Since we chose a = 1, we have that: 

a = ^width(Y° 3m ) (16) 

We have implemented the inductive conformal regression presented in [5 1]. It presents seven nonconfor- 
mity measures. The first one, that we denote by A, is just \y — y\. The other six are just modifications which 
take some metrics of the ^-nearest neighbours into account, such as the mean (or median) (input domain) 
distance in relation to the average distance of the training dataset, or the mean (or median) deviation (of 

5 For instance, if we have two regions Tj = [3.2, 5.4] and T| = [5.3, 15.9] for two different examples, we see that the first interval 
is much tighter. However, we cannot directly see whether an actual value of 5.2 has higher probability for the first example than a 
value of 14.2 for the second, because we cannot directly derive probabilities. A possible way of answering this specific question 
is by adjusting the error probability e. If we increase our tolerance, we may get tighter intervals and we may see that some of 
the values fall out of the interval. However, we cannot derive the probability for each point either. In order to do this we need a 
conditional probability density function. 
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the output domain), corresponding to formulas (24), (25), (29), (30), (31), (32) in [51]. Some of them have 
parameters (7 and p), which we set to 0.5 (as in [51]). 
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0.13 0.62 0.57 
J. 33 1.00 0.94 
J. J4 0. /O 0. /y 
0.16 0.74 0.61 
D.52 0.79 0.66 
0.47 0.86 0.42 
0.32 0.74 0.58 
D.57 0.71 0.57 
0.24 0.69 0.56 
0.44 0.80 0.57 
D.34 0.73 0.53 
0.35 0.70 0.54 
0.19 0.68 0.54 


Mean 


0.36 0.76 0.61 


0.31 0.75 0.60 


0.30 0.72 0.59 


0.33 0.75 0.59 



Table 16: Results (using the datasets in Table 1 1) for three regression methods for which the variance esti- 
mation has been replaced by the variance given by conformal prediction using the nonconformity measure 
A, denoted by LRc, kNNc and Treec. The method Conf keeps the mean which is estimated by conformal 
prediction. 

We analysed the results for all the nonconformity measures, but we just show the results for the noncon- 
formity measure A in Table 1 6, since this measure gives the best results (although the results are relatively 
similar for all of them). Comparing to the results in Table 2, it seems that there is no improvement in the 
variance estimation, except for linear regression. 

G Comparison between NCDE methods 

At the end of section 3 we perform a selection of some of the NCDE methods seen in the section. In this ap- 
pendix, we include the results for a selection of the most relevant methods: the own estimation from the base 
techniques (section 3. 1), conformal prediction (appendix F), a conditional density estimation (CDE) method 
(appendix C), a conditional variance estimation (CVE) method using Tree for residual regression (appendix 
D), and three enrichment methods described in section 3.3: RBE (using Tree as residual regression), uKNC 
and BIN. For all these methods, Tables 17,18 and 19 show the comparison for the base techniques LR, KNN 
and Tree respectively. We only include the results for the metric msll, which considers both the quality of 
the conditional mean and the conditional variance estimation. Note that only for Table 17 the results are 
significant, so other criteria (such as simplicity) are used to finally select the methods in section 3. 

H Proofs 

Here we include the proofs for several results in the paper. 
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LR 
msll 



0.81 
0.80 
0.48 
0.69 
0.84 
0.76 
0.75 
0.76 
1.00 

10 10.76 

11 0.77 

12 0.88 

13 0.58 

14 0.85 

15 0.75 

16 0.78 

17 0.99 



18 
19 



20 0.82 



LRc LR 
msll CDE 
msll 



LR LR 

CVE ENR 
Tree Tree 
msll msll 



LR LR 
ENR ENR 
uKNCBIN 
msll msll 



0.76 
0.72 



0.79 0.83 
0.78 0.72 
0.76 0.50 
0.77 0.88 
0.65 0.68 
0.70 0.64 
0.75 0.75 
0.64 0.63 
0.92 0.97 
0.70 0.59 
0.70 0.72 
0.88 0.76 
0.69 0.77 
0.78 0.74 
0.70 0.80 
0.77 0.81 
0.92 0.78 
0.75 0.82 
0.70 0.76 
0.80 0.73 



0.78 0.78 
0.77 0.78 
0.55 0.55 
0.72 0.72 
0.63 0.63 
0.69 0.70 
0.80 0.76 
0.64 0.64 
0.94 0.94 
0.70 0.70 
0.73 0.74 
0.89 0.86 
0.56 0.58 
0.74 0.74 
0.73 0.74 
0.76 0.75 
0.98 0.97 
0.74 0.74 
0.69 0.69 
0.79 0.79 



0.78 0.78 
0.78 0.78 
0.57 0.54 



0.77 
0.63 
0.71 
0.75 



0.72 
0.63 
0.71 
0.75 



0.64 0.64 
0.73 0.95 
0.67 0.69 
0.72 0.72 
0.85 0.86 
0.72 0.58 
0.74 0.74 
0.75 0.75 
0.74 0.75 
0.91 0.98 
0.74 0.74 
0.69 0.69 
0.79 0.79 



Mean 0.78 0.76 0.74 0.74 0.74 0.73 0.74 



AR 5.45 4.40 4.00 3.50 4.00 3.30 3.35 



Table 17: Results (using the datasets in Table 11) for base technique LR using a selection of the methods 
seen in section 3 as described in appendix G. Methods do not perform equally since the Friedman statistic 
(14.68) is greater than the Critical Value (14.16), so the null hypothesis is rejected (significance level: 0.05). 
Critical difference for the Nemenyi post-hoc test: 0.6922. 



Proof, (for proposition 1 ) We have that r* (x, £, f) can be written as follows: 

/CO 
£{t,y)f{y\*)dy 
-co 
poo 

= argmin / £{t,flf(x) + s)f((jlf(x) + s)\x)ds 

t J —oo 

= argrninjy t(t,flf(x)+s)f((llf(x)+s)\x)ds + J^ £(t,fL(x)+s)f((flf(x)+s)\x)ds 

= argmin jy l(t,jlj.(x)+s)f((jlf(x) + s)\x)ds-J £(t,jl(x)-s)f((jlf(x)-s)\x)ds 

= argmin J [l(t,flf(x) +s)f(((tf(x) +s)\x) -£(t, ftfa) -s)f((fLf(x) -s) \x)}ds 

Since / is symmetric relative to the mean: 

r*(x,£J) = argmin J {l(t,fl f (x) +s) - £(t,fl f (x) - s)} f{([i f (x) +s)\x)ds 

But £ is symmetric, so we have that for every y and r we have that £(y + r,y)=£(y — r,y) which, jointly 
with its commutativity, implies £(y,y — r)=£(y,y + r), so a minimum of the above expression can be found 
when t = flf{x), leading to the expression £(p.f(x),p.f(x) +s) — £(flj(x),flj(x) — s) = 0. So, r*(x,£,f) = 

A/(jc). □ □ 
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18 
19 



kNN 
msll 



0.80 
0.75 
0.61 
0.81 
0.63 
0.65 
0.76 

8 0.61 

9 0.99 

10 0.59 

1 1 0.74 

12 0.77 

13 0.89 

14 0.72 

15 0.73 

16 0.72 

17 D.78 



20 



AR 



kNNckNN 
msll CDE 
msll 



kNN kNN 
CVE ENR 
Tree Tree 
msll msll 



kNN kNN 
ENR ENR 
uKNCBIN 
msll msll 



0.75 
0.70 



0.81 0.83 
0.77 0.72 
0.64 0.51 
0.84 0.88 
0.65 0.68 
0.66 0.63 
0.77 0.76 
0.62 0.61 
0.94 0.97 
0.71 0.59 
0.75 0.72 
0.84 0.76 
0.86 0.76 
0.73 0.74 
0.73 0.80 
0.71 0.81 
0.79 0.77 
0.76 0.82 
0.73 0.76 
0.68 0.73 



0.80 0.80 
0.75 0.76 
0.58 0.62 
0.81 0.81 
0.64 0.64 
0.65 0.66 
0.76 0.76 
0.61 0.61 
0.99 0.99 
0.66 0.67 
0.76 0.76 



0.80 0.80 
0.77 0.77 
0.63 0.64 
0.84 0.81 
0.63 0.63 



0.65 
0.77 
0.61 
0.99 



0.66 
0.77 
0.61 
0.99 



0.81 
0.91 



0.76 
0.90 



0.71 0.71 

0.73 0.73 

0.71 0.72 

0.78 0.83 

0.75 0.75 

0.71 0.70 

0.69 0.69 



0.60 0.60 
0.74 0.75 
0.76 0.75 
0.90 0.89 
0.71 0.71 
0.75 0.75 
0.71 0.71 
0.80 0.80 
0.75 0.75 
0.70 0.70 
0.68 0.68 



Mean 0.73 0.75 0.74 0.74 0.74 0.74 0.74 



3.40 4.90 3.90 4.00 3.85 3.95 4.00 



Table 18: Results (using the datasets in Table 1 1) for base technique kNN using a selection of the methods 
seen in section 3 as described in appendix G. Methods may perform equally since the Friedman statistic 
(5. 164) is lower than the Critical Value (14. 16), so the null hypothesis cannot be rejected (significance level: 
0.05). Critical difference for the Nemenyi post-hoc test: 0.6922. 



Proof, (for proposition 4) We use the expression for (-a(y,y) and decompose it depending on whether t < y 
or not. 



— oo 
t 



J?(x,tJ,£ A a ) = I £ A a (t,y)f(y\x)dy 

poo 

(\-a)(t-y)f(y\x)dy + / a(y - t)f(y\x)dy 

/{ poo poo 

{\-a)yf{y\x)dy + a(y)f(y\x)dy - a tf(y\x)dy 
-oo Jt Jt 

/t ^ p°° 

(l-a)yf(y\x)dy + J ayf(y\x)dy-at(\-P(t\x)) 

/t ^ poo 

yf{y\x)dy+ \ ayf(y\x)dy- at (I - F(t\x)) 
-oo J —oo 

= (\-a)tF(t\x)- [ yf{y\x)dy + ajl{x)-at(\-F(t\x)) 

J —oo 

= afi(x)+tF(t\x) — at— [ yf{y\x)dy 

J —oo 

□ □ 

Proof, (for proposition 5) From proposition 4 we just derive the expression for minimising the expected 
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Tree 
msll 



0.85 
0.66 
0.62 
0.80 
D.63 
0.63 
0.76 

8 D.58 

9 0.99 

10 0.55 

11 0.68 

12 0.71 

13 0.89 

14 0.65 

15 0.72 

16 0.69 

17 D.76 



18 
19 



20 0.69 



AR 



TreecTree 
msll CDE 
msll 



Tree Tree 
CVE ENR 
Tree Tree 
msll msll 



Tree Tree 
ENR ENR 
uKNCBIN 
msll msll 



0.74 
0.69 



0.83 0.83 
0.68 0.72 
0.67 0.51 
0.79 0.88 
0.64 0.68 
0.62 0.63 
0.76 0.74 
0.59 0.62 
0.97 0.97 
0.61 0.58 
0.67 0.72 
0.77 0.76 
0.85 0.77 
0.70 0.74 
0.71 0.80 
0.67 0.81 
0.76 0.77 
0.74 0.82 
0.70 0.76 
0.68 0.73 



0.83 0.85 
0.68 0.66 
0.62 0.60 
0.81 0.81 
0.63 0.63 
0.62 0.62 
0.77 0.76 
0.59 0.58 
1.00 1.00 
0.57 0.57 
0.67 0.68 
0.72 0.72 
0.91 0.90 
0.66 0.65 
0.72 0.72 
0.68 0.68 
0.76 0.76 
0.75 0.74 
0.70 0.69 
0.69 0.69 



0.84 0.84 
0.68 0.66 
0.64 0.63 
0.77 0.81 
0.63 0.63 
0.64 0.65 
0.75 0.76 
0.59 0.59 
0.93 1.00 
0.56 0.56 
0.68 0.68 
0.70 0.74 
0.89 0.90 
0.65 0.65 
0.74 0.73 
0.69 0.68 
0.77 0.76 
0.74 0.74 
0.70 0.69 
0.69 0.68 



Mean0.71 0.72 0.74 0.72 0.72 0.71 0.72 



3.45 3.40 5.30 4.20 3.70 3.95 4.00 



Table 19: Results (using the datasets in Table 1 1) for base technique Tree using a selection of the methods 
seen in section 3 as described in appendix G. Methods may perform equally since the Friedman statistic 
(10.65) is lower than the Critical Value (14.16), so the null hypothesis cannot be rejected (significance level: 
0.05). Critical difference for the Nemenyi post-hoc test: 0.6922. 



loss: 

r*(x,t A , a J) = argmin{afl(x)+tF(t\x)-at- J yf(y\x)dy^ 

In order to find the minimum, we calculate the first derivative and equal it to 0: 

F(t\x)+tf(t\x)-a-tf(t\x) =0 
F(t\x) = a 

Since the second derivate is positive this is a minimum. □ □ 

Proof, (for proposition 6) We follow the same initial steps as in the absolute case. We derive the expected 
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loss (eq. 2) and decompose the expression for i s a (y,y) depending on whether t < y or not. 

/oo 
e s a (t,y)f(y\x)dy 
-oo 

/t r°° 
(l-a)(t-y) 2 f(y\x)dy + / a(y - t) 2 f(y\x)dy 
-co .It 

= f {\-a)t 2 f(y\x)dy + f (\-a){-2ty + y 2 )f{y\x)dy 

J —oo J —oo 

/oo poo 
at 2 f(y\x)dy + J a{-2ty + y 2 )f{y\x)dy 

= (\-a)t 2 F{t\x) + at 2 {\-F{t\x)) 

/t poo 
J,l-a){-2ty + y 2 )f{y\x)dy + J a(-2ty +y 2 )f(y\x)dy 

= (l-2a)t 2 F(t\x) + at 2 + [ (\-2a)(-2ty+y 2 )f(y\x)dy-2(xtfi(x) + ajl2(x) 

+ a [t 2 -2tiX(x) + iX 2 {x)\ 



t 2 F(t\x)-2t yf(y\x)dy+ y 2 f{y\x)dy 



= (I -2a) 

where feix) is the second raw moment of f(y\x). 
Proof, (for proposition 7) From proposition 6, we have: 

r*(x,£ s , a J) = argmin J?(x,t,f,l S a ) 



□ 



□ 



argmin < (1 — 2a) 



t 2 F(t\x)-2t yf{y\x)dy+ y 2 f(y\x)dy 



+ a [t 2 -2fju(jc) + ju 2 (jc)] 



Again, in order to find the minimum, we calculate the first derivative and equal it to 0: 



(l-2a) 



t 2 f{t\x) + 2tF(t\x)-2 f yf(y\x)dy-2t-tf(t\x)+t 2 f(t\x) 

J — oo 



2at-2ajl(x) + = 



(l-2a) 



2tF{t\x)-2 yf(y\x)dy 



+ 2at -2ajl(x) = 



The second derivative is: 

(I -2a) [2F(y\x)+2tf(y\x)-2tf(y\x)} +2a = (1 - 2a)2F(y\x) + 2a 

which is always positive since both F(y\x) and a are between and 1. Consequently, we have a minimum. 

□ □ 

Proof, (for proposition 8) Assuming f(t\x) is a normal distribution, we can standardise f(t\x) as <f>(t') with 
t' = - Then, proposition 7 reduces to: 



{I -2a) 



2t'*(t')-2 f y$(y)dy 

J —oo 



+ 2at'-0 = 



The partial (from — oo to t) first moment of the standard normal distribution is just —<j)(t). This can also 



be seen as a truncated standard normal distribution whose expected value is: E(w|w < t) 



JO 



Since the 
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truncated standard normal distribution is normalised by <t>(t) we get —<j)(t). This can also be obtained by 
just solving the integral. From here, 

( 1 - 2a) [2f '<D(f ') + 2<p (*')] + 2at' = 

where t is obtained using t = 6(x)t' + jl(x). □ 
Proof, (for proposition 9) We start from the expression of the expected loss (proposition 4): 



□ 



f yf{y\x)dy+afl(x)+tP(t\x) — at 

J — oo 



In order to reduce fL^yfiy^dy, we see that it is a partial moment of the normal distribution. This is equal to 
an unnormalised version of the expected value of a truncated distribution, which is E(X|X < T) = }X — <^|^y 

with t = 7 ^ L . Consequently, this term reduces to <J>(f')/t(x) - <j(x)$(t') with?' = (t = &(x)t' +A(x)). 
So, we have: 



•5?{x,t,f,£ a 



a(i{x) +t®(t') - at-®(t')}l(x) + a(x)<j)(t') 

(t - A(x))0(O + d(jc)0 (0 - a{t - fL(x)) 

t-(x(x 



cr(x) ct(x) 

□ 



a- 



a(x) 



a(x) 



□ 



Proof, (for proposition 10) We start from the expression of the expected loss (proposition 6): 



•5?(x,t,f,£ a ) 



(I -2a) 



t 2 F{t\x)-2t f yf{y\x)dy- f y 2 f(y\x)dy 

J — CO J — oo 



+a [t 2 - 2tjX (x) + /t 2 (x) 



(17) 



We reduce fL^yfiy^dy as we did in the proof of proposition 9, as a partial moment of the normal distri- 
bution, to 0(O£(*) - 6(*)<K0 with t' = '-^^ (* = 6(x>' + A(x)). 

We reduce f t _ 00 y 2 f(y\x)dy as a partial second order moment of the normal distribution (or a full second 
order moment of the truncated normal distribution), to <J>(?')(/i(x) 2 — 2jl(x)a (x) + a(x) 2 (l — 
The last term faix) is the second order moment of the normal distribution, which is just /t(x) 2 + a(x) 2 . 
Plugging all this into (17), and using the short notation ji for /t(x) and a for a(x) we have: 
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J?(x,t,f,£ s a ) = (l-2o) 



With q{t>) = |£}. 



^(0 - 2f (*(0*t - a0 (0) - *(0(M 2 - 2 ^^^|y + ^ 2 (1 " ''f|y)) 



+a[f z -2^ + (M 2 + ^ 2 )] 



[l-2a)*(0 



' 2 + " " (A* 2 -2Ma^ + a 2 ( l - 

<I>(0 O(f') <£(0 



+a((f-At) 2 + a 2 ) 

(1 - 2a)<S>(t') [t 2 + -2t(n - aq(t')) - (H 2 - 2pLCq{t') + a 2 (l - t'q{t')))] 
+a((t - ji) 2 + a 2 ) 

0(0(1 -2a) [t 2 -2t(n - aq(t')) - (n 2 -2naq(t') + a 2 (l -t'q(t')))] + a((t - n) 2 + a 2 ) 
0(0(1 - 2a) [(f'a + At) 2 - 2(f'a + ju)(ju — a<?(0) - (M 2 - 2pLoq{t') + a 2 (l - f'?(0))] 
+a((f'a) 2 + a 2 ) 

0(0(1 - 2a) [(?'a) 2 + At 2 + 2t'c 2 q{t') - 2\l 2 + 2pLOq{t') -pL 2 + 2pLOq{t') - o 2 + dV 9 (0)] 
+aa 2 (f' + l) 

0(0(1 - 2a) [(f'a) 2 + 3t'a 2 q(t') - 2\i 2 + 4pLoq{t') - a 2 ] + aa 2 (t' + 1) 

□ □ 
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